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Computer Sciences Department, University of Wisconsin-Madison 


Abstract 


We present Nameless Writes, a new device interface that 
removes the need for indirection in modern solid-state 
storage devices (SSDs). Nameless writes allow the de- 
vice to choose the location of a write; only then is the 
client informed of the name (1.e., address) where the block 
now resides. Doing so allows the device to control block- 
allocation decisions, thus enabling it to execute critical 
tasks such as garbage collection and wear leveling, while 
removing the need for large and costly indirection tables. 
We demonstrate the effectiveness of nameless writes by 
porting the Linux ext3 file system to use an emulated 
nameless-writing device and show that doing so both re- 
duces space and time overheads, thus making for simpler, 
less costly, and higher-performance SSD-based storage. 


1 Introduction 


Indirection is a core technique in computer systems [28]. 
Whether in the mapping of file names to blocks, or a vir- 
tual address space to an underlying physical one, system 
designers have applied indirection to improve system per- 
formance, reliability, and capacity for many years. 

For example, modern hard disk drives use a modest 
amount of indirection to improve reliability by hiding un- 
derlying write failures. When a write to a particular physi- 
cal block fails, a hard disk will remap the block to another 
location on the drive and record the mapping such that fu- 
ture reads will receive the correct data. In this manner, a 
drive transparently improves reliability without requiring 
any changes to the client above. 

Indirection is particularly important in the new class of 
flash-based storage commonly referred to as Solid State 
Devices (SSDs). In modern SSDs, an indirection map in 
the Flash Translation Layer (FTL) enables the device to 
map writes in its virtual address space to any underlying 
physical location [11, 14, 16, 19, 21, 22). 

FTLs use indirection for two reasons: first, to trans- 
form the erase/program cycle mandated by flash into the 
more typical write-based interface via copy-on-write tech- 
niques, and second, to implement wear leveling [18, 20], 
which is critical to increasing SSD lifetime. Because a 
flash block becomes unusable after a certain number of 
erase-program cycles (10,000 or 100,000 cycles accord- 
ing to manufacturers [8, 15]), such indirection is needed 
to spread the write load across flash blocks evenly and 
thus ensure that no particularly popular block causes the 
device to fail prematurely. 


Unfortunately,the indirection such as found in many 
FTLs comes at a high price, which manifests as perfor- 
mance costs, space overheads, or both. If the FTL can 
flexibly map each virtual page in its address space (as- 
suming a typical page size of 2 KB), an incredibly large 
indirection table is required. For example, a 1-TB SSD 
would need 2 GB of table space simply to keep one 32-bit 
pointer per 2-KB page of the device. Clearly, a completely 
flexible mapping is too costly; putting vast quantities of 
memory (usually SRAM) into an SSD is prohibitive. 

Because of this high cost, most SSDs do not offer a 
fully flexible per-page mapping. A simple approach pro- 
vides only a pointer per block of the SSD (a block typ- 
ically contains 64 or 128 2-KB pages), which reduces 
overheads by the ratio of block size to page size. The 
1-TB drive would now only need 32 MB of table space, 
which is more reasonable. However, as clearly articulated 
by Gupta et al. [16], block-level mappings have high per- 
formance costs due to excessive garbage collection. 

As a result, the majority of FTLs today are built us- 
ing a hybrid approach, mapping most data at block level 
and keeping a small page-mapped area for updates [11, 
21, 22]. Hybrid approaches keep space overheads low 
while avoiding the high overheads of garbage collection, 
at the cost of additional device complexity. Unfortunately, 
garbage collection can still be costly, reducing the per- 
formance of the SSD, sometimes quite noticeably [16]. 
Regardless of the approach, FTL indirection incurs a sig- 
nificant cost; as SSDs scale, even hybrid schemes mostly 
based on block pointers will become infeasible. 

In this paper, we introduce nameless writes, an ap- 
proach that removes most of the costs of indirection in 
flash-based SSDs while still retaining its benefits. Our ap- 
proach is a specific instance of de-indirection, in which an 
extra layer of indirection is removed. Unlike most writes, 
which specify both the data to write as well as a name 
(usually in the form of a logical address), a nameless write 
simply passes the data to the device. The device is free to 
choose any underlying physical block for the data; after 
the device names the block (1.e., decides where to write 
it), it informs the client of its choice. The client then can 
record the name for future reads. 

One potential problem with nameless writes is the re- 
cursive update problem: if all writes are nameless, then 
any update to the file system requires a recursive set of up- 
dates up the file-system tree. To circumvent this problem, 
we introduce a segmented address space, which consists 
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of a (large) physical address space for nameless writes, 
and a (small) virtual address space for traditional named 
writes. A file system running atop a nameless SSD can 
keep pointer-based structures in the virtual space; updates 
to those structures do not necessitate further updates up 
the tree, thus breaking the recursion. 


Nameless writes offer great advantage over traditional 
writes, as they largely remove the need for indirection. 
Instead of pretending that the device can receive writes in 
any frequency to any block, a device that supports name- 
less writes is free to assign any physical page to a write 
when it is written; by returning the true name (1e., the 
physical address) of the page to the client above (e.g., the 
file system), indirection is largely avoided, reducing the 
monetary cost of the SSD, improving its performance, and 
simplifying its internal structure. 


Nameless writes (largely) remove the costs of indirec- 
tion without giving away the primary responsibility an 
SSD manufacturer maintains: wear leveling. If an SSD 
simply exports the physical address space to clients, a 
simplistic file system or workload could cause the de- 
vice to fail rather rapidly, simply by over-writing the same 
block repeatedly (whether by design or simply through a 
file-system bug). With nameless writes, no such failure 
mode exists. Because the device retains control of nam- 
ing, it retains control of block placement, and thus can 
properly implement wear leveling to ensure a lengthy de- 
vice lifetime. We believe that any solution that does not 
have this property is not viable, as no manufacturer would 
like to be so easily exposed to failure. 


We demonstrate the benefits of nameless writes by port- 
ing the Linux ext3 file system to use a nameless SSD. 
Through extensive analysis on an emulated nameless SSD 
and comparison with different FTLs, we show the bene- 
fits of the new interface, in both reducing the space costs 
of indirection and improving random-write performance. 
Overall, we find that a nameless SSD uses a much smaller 
fraction of memory for indirection than a hybrid SSD 
while improving performance by an order of magnitude 
for some workloads. 


The rest of this paper is organized as follows. In Sec- 
tion 2, we discuss the costs and benefits of indirection, 
and in Section 3 we present the nameless write interface. 
In Section 4, we show how to build a nameless-writing 
device. In Section 5, we describe how to port the Linux 
ext3 file system to use the nameless-writing interface, and 
in Section 6, we evaluate nameless writes through experi- 
mentation atop an emulated nameless-writing device. We 
discuss several related works in Section 7. Finally, in Sec- 
tion 8, we conclude and discuss our future work. 
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2 Indirection 


It is said that “all problems in computer science can be 
solved by another level of indirection,’ a quote that is 
often attributed to Butler Lampson. Lampson, however, 
gives credit for this wisdom to David Wheeler, who not 
only uttered these famous words, but also usually added 
**’..but that usually will create another problem [28].” 

Indirection is a fundamental technique in computer sys- 
tems. Before delving into the details of nameless writes, 
we first present a discussion of some of the general prob- 
lems and solutions in systems that use indirection. First, 
we discuss why many systems utilize multiple levels of 
indirection, a problem we term excess indirection. We 
then describe the general solution to said problem, de- 
indirection, which removes an extra layer of indirection 
to improve performance or reduce space overheads. 


2.1 Excess Indirection 

Excess indirection exists in many systems that are widely 
used today, as well as in research prototypes. We now dis- 
cuss four prominent examples: OS virtual memory run- 
ning atop a hypervisor, a file system running atop a single 
disk, a file system atop a RAID array, and the focus of our 
work, file systems atop flash-based SSDs. 

An excellent example of excess indirection arises in 
memory management of operating systems running atop 
hypervisors [9]. The OS manages virtual-to-physical 
mappings for each process that is running; the hypervi- 
sor, in turn, manages physical-to-machine mappings for 
each OS. In this manner, the hypervisor has full control 
over the memory of the system, whereas the OS above 
remains unchanged, blissfully unaware that it is not man- 
aging a real physical memory. Excess indirection leads 
to both space and time overheads in virtualized systems. 
The space overhead comes from maintaining OS physical 
addresses to machine addresses mapping for each page 
and from possible additional space overhead [1]. Time 
overheads exist as well in cases like the MIPS TLB-miss 
lookup in Disco [9]. 

Indirection also exists in modern disks. For example, 
modern disks maintain a small amount of extra indirec- 
tion that maps bad sectors to nearby locations, in order to 
improve reliability in the face of write failures. Other ex- 
amples include ideas for “smart” disks that remap writes 
in order to improve performance (for example, by writing 
to the nearest free location), which have been explored 
in previous research such as Loge [13] and “intelligent” 
disks [30]. These smart disks require large indirection 
tables inside the drive to map the logical address of the 
write to its current physical location. This requirement in- 
troduces new reliability challenges, including how to keep 
the indirection table persistent. Finally, fragmentation of 
randomly-updated files is also an issue. 

File systems running atop modern RAID storage ar- 
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rays provide another excellent example of excess indi- 
rection. Modern RAIDs often require indirection tables 
for fully-flexible control over the on-disk locations of 
blocks. In AutoRAID, a level of indirection allows the 
system to keep active blocks in mirrored storage for per- 
formance reasons, and move inactive blocks to RAID to 
increase effective capacity [32] and overcome the RAID 
small-update problem [26]. When a file system runs atop 
a RAID, excess indirection exists because the file sys- 
tem maps logical offsets to logical block addresses. The 
RAID, in turn, maps logical block addresses to physical 
(disk, offset) pairs. Such systems add memory space over- 
head to maintain these tables and meet the challenges of 
persisting the tables across power loss. 

The focus of our work is flash-based SSDs, and thus it 
is no surprise that these too exhibit excess indirection. The 
extra level of indirection is provided via the Flash Trans- 
lation Layer (FTL). The FTL is needed for two primary 
reasons. First, it is used to transform reads and writes 
issued by the client into reads and erase/program cycles 
supported by actual flash chips. In particular, because of 
the high cost of block erases (required before program- 
ming a page within the block), FTLs map current write 
activity to a small set of active blocks in a log-structured 
fashion, thus amortizing the cost of erases. Second, the 
FTL enables the SSD to implement wear leveling. Re- 
peatedly erasing and programming a particular block will 
render it unreadable; thus, SSDs use the indirection pro- 
vided by the FTL to spread write load across blocks and 
thus ensure that the device has a longer lifetime. 


2.2 De-indirection 


Because of these costs, system designers have long sought 
methods and techniques to reduce the costs of excess indi- 
rection in various systems. We label the removal of excess 
indirection de-indirection. 

The basic idea is simple. Let us imagine a system with 
two levels of mapping, and thus excess indirection. The 
first indirection F’ maps items in the A space to items 
in the B space: F'(A;) — B,. The second indirection 
G maps items in the B space to those in the C’ space: 
G(B,;) — Cy. To look up item 7, one performs the fol- 
lowing “excessive” indirection: G(F'(7)). 

De-indirection removes the second level of indirec- 
tion by evaluating the second mapping G() for all values 
mapped by F'(): Vi: F(i) — G(F(z)). Thus, the top- 
level mapping simply extracts the needed values from the 
lower level indirection and installs them directly. 

De-indirection has been successfully applied in a 
few domains, most notably within hypervisors. The 
Turtles project [7] provides an excellent example: in 
a recursively-virtualized environment (with hypervisors 
running on hypervisors), the Turtles system installs what 
the authors refer to as multi-dimensional page tables. 


Their approach essentially collapses multiple page tables 
into a single extra level of indirection, and thus reduces 
space and time overheads, making the costs of recursive 
virtualization more palatable. 


2.3 Summary 

Excess indirection is common across virtual memory and 
storage systems. In some cases, such as with hypervisor- 
based memory virtualization, it is required for function- 
ality; each OS believes it owns the same physical mem- 
ory, and thus cannot share it without the indirection pro- 
vided by the hypervisor. In other cases, it improves perfor- 
mance, as we observed with disk systems and SSDs. An- 
other reason for indirection is modularity and code sim- 
plicity. Finally, reliability is often the reason for excess 
indirection, notably within a single disk to handle write 
failures and within an SSD to perform wear leveling. 

In all cases, at least part of the reason for excess indi- 
rection is the need to keep a fixed interface between higher 
and lower layers of the system. Without such a constraint, 
one could often remove the excess indirection and thus 
improve the system. For example, if an OS running on a 
para-virtualized system [31] is modified to request a ma- 
chine page from the hypervisor and then install the correct 
virtual-to-machine page translation in its page tables, the 
hypervisor is relieved of having to manage this extra level 
of indirection, thus improving performance and reducing 
space overheads. 


3 Nameless Writes 


In this section, we discuss a new device interface that en- 
ables flash-based SSDs to remove a great deal of their in- 
frastructure for indirection. We call a device that supports 
this interface a Nameless-writing Device. Table 1 summa- 
rizes the nameless-writing device interface. 

The key feature of a nameless-writing device is its 
ability to perform nameless writes; however, to facilitate 
clients (such as file systems) to use a nameless-writing de- 
vice, a number of other features are useful as well. In par- 
ticular, the nameless-writing device should provide sup- 
port for a segmented address space, migration callbacks, 
and associated metadata. We discuss these features in this 
section and how a prototypical file system could use them. 


3.1 Nameless Write Interfaces 

We first present the basic device interfaces of Nameless 
Writes: nameless (new) write, nameless overwrite, physi- 
cal read, and free. 

The nameless write interface completely replaces the 
existing write operation. A nameless write differs from a 
traditional write in two important ways. First, a nameless 
write does not specify a target address (1.e., a name); this 
allows the device to select the physical location without 
control from the client above. Second, after the device 
writes the data, it returns a physical address (1.e., a name) 
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Virtual Read 


down: _ virtual address, length 
up: status, data 
Virtual Write 
down: _ virtual address, data, length 
up: status 


Nameless Write 
down: data, length, metadata 
up: status, resulting physical address(es) 


Nameless Overwrite 


down: — old physical address(es), data, length, metadata 
up: status, resulting physical address(es) 

Physical Read 
down: _ physical address, length, metadata 
up: Status, data 

Free 
down: _ virtual/physical addr, length, metadata, flag 
up: Status 

Migration [Callback] 
up: old physical addr, new physical addr, metadata 
down: — old physical addr, new physical addr, metadata 


Table 1: The Nameless-Writing Device Interfaces The 
table presents the nameless-writing device interfaces. 


and status to the client, which then keeps the name in its 
own structure for future reads. 


The nameless overwrites interface is similar to the 
nameless (new) write interface, except that it also passes 
the old physical address(es) to the device. The device 
frees the data at the old physical address(es) and then per- 
forms a nameless write. 


Read operations are mostly unchanged; as usual, they 
take as input the physical address to be read and return 
the data at that address and a status indicator. A slight 
change of the read interface is the addition of metadata in 
the input, for reasons that will be described in Section 3.4. 


Because a nameless write is an allocating operation, a 
nameless-writing device needs to also be informed of de- 
allocation as well. Most SSDs refer to this interface as 
the free or trim command. Once a block has been freed 
(trimmed), the device is free to re-use it. 


Finally, we consider how the nameless write interface 
could be utilized by a typical file-system client such as 
Linux ext3. For illustration, we examine the operations to 
append a new block to an existing file. First, the file sys- 
tem issues a nameless write of the newly-appended data 
block to a nameless-writing device. When the nameless 
write completes, the file system is informed of its address 
and can update the corresponding in-memory inode for 
this file so that it refers to the physical address of this 
block. Since the inode has been changed, the file sys- 
tem will eventually flush it to the disk as well; the inode 
must be written to the device with another nameless write. 


FAST 712: 10th USENIX Conference on File and Storage Technologies 


Again, the file system waits for the inode to be written and 
then updates any structures containing a reference to the 
inode. If nameless writes are the only interface available 
for writing to the storage device, then this recursion will 
continue until a root structure is reached. For file sys- 
tems that do not perform this chain of updates or enforce 
such ordering, such as Linux ext2, additional ordering and 
writes are needed. This problem of recursive update has 
been solved in other systems by adding a level of indirec- 
tion (e.g., the inode map in LFS [27]). 


3.2 Segmented Address Space 


To solve the recursive update problem without requiring 
substantial changes to the existing file system, we intro- 
duce a segmented address space with two segments (see 
Figure 1): the virtual address space, which uses virtual 
read, write and free interfaces, and the physical address 
space, Which uses nameless read, write, overwrite, and 
free interfaces. 


The virtual segment presents an address space from 
blocks 0 through V — 1, and is a virtual block space of 
size V blocks. The device virtualizes this address space, 
and thus keeps a (small) indirection table to map accesses 
to the virtual space to the correct underlying physical lo- 
cations. Reads and writes to the virtual space are identical 
to reads and writes on typical devices. The client sends 
an address and a length (and, if a write, data) down to the 
device; the device replies with a status message (success 
or failure), and if a successful read, the requested data. 


The nameless segment presents an address space from 
blocks O through P — 1, and is a physical block space of 
size P blocks. The bulk of the blocks in the device are 
found in this physical space, which allows typical named 
reads; however, all writes to physical space are nameless, 
thus preventing the client from directly writing to physical 
locations of its choice. 


We use a virtual/physical flag to indicate the segment a 
block is in and the proper interface it should go through. 
The size of the two segments are not fixed. Allocation in 
either segment can be performed while there is still space 
on the device. A device space usage counter can be main- 
tained for this purpose. 


The reason for the segmented address space is to en- 
able file systems to largely reduce the levels of recursive 
updates that would occur with only nameless writes. File 
systems such as ext2 and ext3 can be designed such that 
inodes and other metadata are placed in the virtual ad- 
dress space. Such file systems can simply issue a write 
to an inode and complete the update without needing to 
modify directory structures that reference the inode. Thus, 
the segmented address space allows updates to complete 
without propagating throughout the directory hierarchy. 
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Figure 1: The Segmented Address Space. A nameless- 
writing device provides a segmented address space to clients. 
The smaller virtual space allows normal reads and writes, which 
the device in turn maps to underlying physical locations. The 
larger physical space allows reads to physical addresses, but 
only nameless writes. In the example, only two blocks of the vir- 
tual space are currently mapped, VO and V2, to physical blocks 
P2 and P3, respectively. 


Physical Address Space 





3.3. Migration Callback 


Several kinds of devices such as flash-based SSDs need to 
migrate data for reasons like wear leveling. We propose 
the migration callback interface to support such needs. 

A typical flash-based SSD performs wear leveling via 
indirection: it simply moves the physical blocks and up- 
dates the map. With nameless writes, blocks in the phys- 
ical segment cannot be moved without informing the file 
system. To allow the nameless-writing device to move 
data for wear leveling, a nameless-writing device uses mi- 
gration callbacks to inform the file system of the physical 
address change of a block. The file system then updates 
any metadata pointing to this migrated block. 


3.4 Associated Metadata 


The final interface of a nameless-writing device is used to 
enable the client to quickly locate metadata structures that 
point to data blocks. The complete specification for as- 
sociated metadata supports communicating metadata be- 
tween the client and device. Specifically, the nameless 
write command is extended to include a third parameter: a 
small amount of metadata, which is persistently recorded 
adjacent to the data in a per-block header. Reads and mi- 
gration callbacks are also extended to include this meta- 
data. The associated metadata is kept with each block 
buffer in the page cache as well. 

This metadata enables the client file system to read- 
ily identify the metadata structure(s) that points to a data 
block. For example, in ext3 we can locate the metadata 
structure that points to a data block by the inode number, 
the inode generation number, and the offset of the block in 
the inode. For file systems that already explicitly record 
back references, such as btrfs and NoFS [10], the back 
references can simply be reused for our purposes. 


Such metadata structure identification can be used in 
several tasks. First, when searching for a data block in the 
page cache, we obtain the metadata information and com- 
pare it against the associated metadata of the data blocks 
in the page cache. Second, the migration callback process 
uses associated metadata to find the metadata that needs to 
be updated when a data block is migrated. Finally, associ- 
ated metadata enables recovery in various crash scenarios, 
which we will discuss in detail in Section 5.7. 

One last issue worth noticing is the difference between 
the associated metadata and address mapping tables. Un- 
like address mapping tables, the associated metadata is 
not used to locate physical data and is only used by the 
device during migration callbacks and crash recovery. 
Therefore, it can be stored adjacent to the data on the de- 
vice. Only a small amount of the associated metadata is 
fetched into device cache for a short period of time dur- 
ing migration callbacks or recovery. Therefore, the space 
cost of associated metadata is much smaller than address 
mapping tables. 


3.5 Implementation Issues 


We now discuss various implementation issues that arise 
in the construction of a nameless-writing device. We fo- 
cus on those issues different from a standard SSD, which 
are covered in detail elsewhere [16]. 

A number of issues revolve around the virtual segment. 
Most importantly, how big should such a segment be? Un- 
fortunately, its size depends heavily on how the client uses 
it, as we will see when we port Linux ext3 to use nameless 
writes in Section 5. Our results in Section 6 show that a 
small virtual segment is usually sufficient. 

The virtual space, by definition, requires an in-memory 
indirection table. Fortunately, this table is quite small, 
likely including simple page-level mappings for each page 
in the virtual segment. However, the virtual address space 
could be made larger than the size of the table; in this 
case, the device would have to swap pieces of the page 
table to and from the device, slowing down access to the 
virtual segment. Thus, while putting many data structures 
into the virtual space is possible, ideally the client should 
be miserly with the virtual segment, in order to avoid ex- 
ceeding the supporting physical resources. 

Another concern is the extra level of information natu- 
rally exported by exposing physical names to clients. Al- 
though the value of physical names has been extolled by 
others [12], a device manufacturer may feel that such in- 
formation reveals too much of their “secret sauce” and 
thus be wary of adopting such an interface. We believe 
that if such a concern exists, the device could hand out 
modified forms of the true physical addresses, thus trying 
to hide the exact addresses from clients. Doing so may ex- 
act additional performance and space overheads, perhaps 
the cost of hiding information from clients. 
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4 Nameless-Writing Device 


In this section, we describe our implementation of an 
emulated nameless-writing SSD. With nameless writes, 
a nameless-writing SSD can have a simpler FTL, which 
has the freedom to do its own allocation and wear level- 
ing. We first discuss how we implement the nameless- 
writing interfaces and then propose a new garbage collec- 
tion method that avoids file-system interaction. We defer 
the discussion of wear leveling to Section 5.6. 


4.1 Nameless-Writing Interface Support 
We implemented an emulated nameless-writing SSD that 
performs data allocation in a log-structured fashion by 
maintaining active blocks that are written in sequential or- 
der. When a nameless write is received, the device allo- 
cates the next free physical address, writes the data, and 
returns the physical address to the file system. 

To support the virtual block space, the nameless- 
writing device maintains a mapping table between logi- 
cal and physical addresses in its device cache. When the 
cache is full, the mapping table is swapped out to the flash 
storage of the SSD. As our results show in Section 6.1, the 
mapping table size of typical file system images is small; 
thus, such swapping rarely happens in practice. 

The nameless-writing device handles trims in a man- 
ner similar to traditional SSDs; it invalidates the physical 
address sent by a trim command. During garbage collec- 
tion, invalidated pages can be recycled. The device also 
invalidates the old physical addresses of overwrites. 

A nameless-writing device needs to keep certain asso- 
ciated metadata for nameless writes. We choose to store 
the associated metadata of a data page in its Out-Of-Band 
(OOB) area. The associated metadata is moved together 
with data pages when the device performs a migration. 


4.2 In-place Garbage Collection 
In this section, we describe a new garbage collection 
method for nameless-writing devices. Traditional FTLs 
perform garbage collection on a flash block by reclaim- 
ing its invalid data pages and migrating its live data pages 
to new locations. Such garbage collection requires a 
nameless-writing device to inform the file system of the 
new physical addresses of the migrated live data; the file 
system then needs to update and write out its metadata. To 
avoid the costs of such callbacks and additional metadata 
writes, we propose in-place garbage collection, which 
writes the live data back to the same location instead of 
migrating it. A similar hole-plugging approach was pro- 
posed in earlier work [24], where live data is used to plug 
the holes of most utilized segments. 

To perform in-place garbage collection, the FTL selects 
a candidate block using a certain policy. The FTL reads 
all live pages from the chosen block together with their 
associated metadata, stores them temporarily in a super- 
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capacitor- or battery-backed cache, and then erases the 
block. The FTL next writes the live pages to their orig- 
inal addresses and tries to fill the rest of the block with 
writes in the waiting queue of the device. Since a flash 
block can only be written in one direction, when there are 
no waiting writes to fill the block, the FTL marks the free 
space in the block as unusable. We call such space wasted 
space. During in-place garbage collection, the physical 
addresses of live data are not changed. Thus, no file sys- 
tem involvement is needed. 


Policy to choose candidate block: A natural question 
is how to choose blocks for garbage collection. A simple 
method is to pick blocks with the fewest live pages so that 
the cost of reading and writing them back is minimized. 
However, choosing such blocks may result in an excess of 
wasted space. In order to pick a good candidate block for 
in-place garbage collection, we aim to minimize the cost 
of rewriting live data and to reduce wasted space during 
garbage collection. We propose an algorithm that tries to 
maximize the benefit and minimize the cost of in-place 
garbage collection. We define the cost of garbage col- 
lecting a block to be the total cost of erasing the block 


(Terase)s reading (1 page-read) and writing (1 page-write) 
live data (Ny giiq) in the block. 


cost = ane + (Dnagereed + agers) a Nyalid 


We define benefit as the number of new pages that can 
potentially be written in the block. Benefit includes the 
following items: the current number of waiting writes in 
the device queue (Nwait-write), Which can be filled into 
empty pages immediately, the number of empty pages 
at the end of a block (Njgst), Which can be filled at a 
later time, and an estimated number of future writes based 
on the speed of incoming writes (Syy,iz-). While writ- 
ing valid pages (Nygiig) and waiting writes (Nwait_write)s 
new writes will be accumulated in the device queue. We 
account for these new incoming writes by Thage_write * 
(Nouatia + Nwait-write) * Swrite. Since we can never write 
more than the amount of the recycled space (i.e., number 
of invalid pages, Ninvatiaq) of a block, the benefit function 
uses the minimum of the number of invalid pages and the 
number of all potential new writes. 


bene fit = min(Ninvalids Nwpadiaoriie za Niast 
le 1 nage_write a GNvatea = Ni netiie, * Omnis, 


The FTL calculates the ar ratio of all blocks that 
contain invalid pages and selects the block with the maxi- 
mal ratio to be the garbage collection candidate. Compu- 
tationally less expensive algorithms could be used to find 
reasonable approximations; such an improvement is left 
to future work. 
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5 Nameless Writes on ext3 


In this section we discuss our implementation of name- 
less writes on the Linux ext3 file system. The Linux 
ext3 file system is a classic journaling file system that is 
commonly used in many Linux distributions. It extends 
the Linux ext2 file system and uses the same allocation 
method as ext2. It provides three journaling modes: data 
mode, ordered mode, and journal mode. The ordered jour- 
naling mode of ext3 is a commonly used mode, which 
writes metadata to the journal and writes data to disk be- 
fore committing metadata of the transaction. It provides 
ordering that can be naturally used by nameless writes, 
since the nameless-writing interface requires metadata to 
reflect physical address returned by data writes. When 
committing metadata in ordered mode, the physical ad- 
dresses of data blocks are known to the file system be- 
cause data blocks are written out first. Thus, we imple- 
mented nameless writes with ext3 ordered mode; other 
modes are left for future work. 


5.1 Segmented Address Space 

We first discuss physical and virtual address space separa- 
tion and modified file-system allocation on ext3. We use 
the physical address space to store all data blocks and the 
virtual address space to store all metadata structures, in- 
cluding superblocks, inodes, data and inode bitmaps, indi- 
rect blocks, directory blocks, and journal blocks. We use 
the type of a block to determine whether it is in the virtual 
or the physical address space and the type of interface it 
goes through. 

The nameless-writing file system does not perform al- 
location of the physical address space and only allocates 
metadata in the virtual address space. Therefore, we do 
not fetch or update group bitmaps for nameless block al- 
location. For these data blocks, the only bookkeeping task 
that the file system needs to perform is tracking overall de- 
vice space usage. Specifically, the file system checks for 
total free space of the device and updates the free space 
counter when a data block is allocated or de-allocated. 
Metadata blocks in the virtual physical address space are 
allocated in the same way as the original ext3 file system, 
thus making use of existing bitmaps. 


5.2 Associated Metadata 


We include the following items as associated metadata of 
a data block: 1) the inode number or the logical address 
of the indirect block that points to the data block, 2) the 
offset within the inode or the indirect block, 3) the inode 
generation number, and 4) a timestamp of when the data 
block was last updated. Items 1 to 3 are used to identify 
the metadata structure that points to a data block. Item 
4 is used during the migration callback process to update 
the metadata structure with the most up-to-date physical 
address of a data block. 


All the associated metadata is stored in the OOB area 
of a data block. The total amount of additional status 
we store in the OOB area is less than 48 bytes, smaller 
than the typical 128-byte OOB size of 4-KB flash pages. 
For reliability reasons, we assume that a data page and its 
OOB area are always written atomically. 


5.3. Write 


To perform a nameless write, the file system sends the data 
and the associated metadata of the block to the device. 
When the device finishes a nameless write and sends back 
its physical address, the file system updates the inode or 
the indirect block pointing to it with the new physical ad- 
dress. It also updates the block buffer with the new physi- 
cal address. In ordered journaling mode, metadata blocks 
are always written after data blocks have been commit- 
ted; thus on-disk metadata is always consistent with its 
data. The file system performs overwrites similarly. The 
only difference is that overwrites have an existing phys- 
ical address, which is sent to the device; the device uses 
this information to invalidate the old data. 


5.4 Read 


We change two parts of the read operation of data blocks 
in the physical address space: reading from the page cache 
and reading from the physical device. To search for a data 
block in the page cache, we compare the metadata index 
(e.g., inode number, inode generation number, and block 
offset) of the block to be read against the metadata associ- 
ated with the blocks in the page cache. If the buffer is not 
in the page cache, the file system fetches it from the de- 
vice using its physical address. The associated metadata 
of the data block is also sent with the read operation to 
enable the device to search for remapping entries during 
device wear leveling (see Section 5.6). 


5.5 Free 


The current Linux ext3 file system does not support the 
SSD trim operation. We implemented the ext3 trim oper- 
ation in a manner similar to ext4. Trim entries are created 
when the file system deletes a block (named or nameless). 
A trim entry contains the logical address of a named block 
or the physical address of a nameless block, the length of 
the block, its associated metadata, and the address space 
flag. The file system then adds the trim entry to the cur- 
rent journal transaction. At the end of transaction commit, 
all trim entries belonging to the transaction are sent to the 
device. The device locates the block to be deleted using 
the information contained in the trim operation and inval- 
idates the block. 

When a metadata block is deleted, the original ext3 de- 
allocation process is performed. When a data block is 
deleted, no de-allocation is performed (1.e., bitmaps are 
not updated); only the free space counter is updated. 
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5.6 Wear Leveling with Callbacks 


When a nameless-writing device performs wear leveling, 
it migrates live data to achieve even wear of the device. 
When such migration happens with data blocks in the 
physical address space, the file system needs to be in- 
formed about the change of their physical addresses. In 
this section, we describe how the nameless-writing device 
handles data block migration and how it interacts with the 
file system to perform migration callbacks. 


When live nameless data blocks (together with their 
associated metadata in the OOB area) are migrated dur- 
ing wear leveling, the nameless-writing device creates a 
mapping from the data block’s old physical address to its 
new physical address and stores it together with its asso- 
ciated metadata in a migration mapping table in the de- 
vice cache. The migration mapping table is used to locate 
the migrated physical address of a data block for reads 
and overwrites, which may be sent to the device with the 
block’s old physical address. After the mapping has been 
added, the old physical address is reclaimed and can be 
used by future writes. 


At the end of a wear-leveling operation, the device 
sends a migration callback to the file system, which con- 
tains all migrated physical addresses and their associated 
metadata. The file system then uses the associated meta- 
data to locate the metadata pointing to the data block and 
updates it with the new physical address in a background 
process. Next, the file system writes changed metadata to 
the device. When a metadata write finishes, the file sys- 
tem deletes all the callback entries belonging to this meta- 
data block and sends a response to the device, informing 
it that the migration callback has been processed. Finally, 
the device deletes the remapping entry when receiving the 
response of a migration callback. 


For migrated metadata blocks, the file system does not 
need to be informed of the physical address change since 
it is kept in the virtual address space. Thus, the device 
does not keep remapping entries or send migration call- 
backs for metadata blocks. 


During the migration callback process, we allow reads 
and overwrites to the migrated data blocks. When receiv- 
ing a read or an overwrite during the callback period, the 
device first looks in the migration mapping table to locate 
the current physical address of the data block and then 
performs the request. 


Since all remapping entries are stored in the on-device 
RAM before the file system finishes processing the mi- 
gration callbacks, we may run out of RAM space if the 
file system does not respond to callbacks or responds 
too slowly. In such a case, we simply prohibit future 
wear-leveling migrations and prevent block wear-out only 
through garbage collection. 
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5.7. Reliability Discussion 

The changes of the ext3 file system discussed above may 
cause new reliability issues. In this section, we discuss 
several reliability issues and our solutions to them. 

There are three main reliability issues related to name- 
less writes. First, we maintain a mapping table in the 
on-device RAM for the virtual address space. This table 
needs to be reconstructed each time the device powers on 
(either after a normal power-off or a crash). Second, the 
in-memory metadata can be inconsistent with the physical 
addresses of nameless blocks because of a crash after writ- 
ing a data block and before updating its metadata block, 
or because of a crash during wear-leveling callbacks. Fi- 
nally, crashes can happen during in-place garbage collec- 
tion, specifically, after reading the live data and before 
writing them back, which may cause data loss. 

We solve the first two problems by using the meta- 
data information maintained in the device OOB area. We 
store logical addresses with data pages in the virtual ad- 
dress space for reconstructing the logical-to-physical ad- 
dress mapping table. We store associated metadata, as 
discussed in Section 3.4, with all nameless data. We also 
store the validity of all flash pages in their OOB area. We 
maintain an invariant that metadata in the OOB area is al- 
ways consistent with the data in the flash page by writing 
the OOB area and the flash page atomically. 

We solve the in-place garbage collection reliability 
problem by requiring the use of a small memory backed 
by battery or super-capacitor. Notice that the amount of 
live data we need to hold during a garbage collection op- 
eration is no more than the size of an SSD block, typically 
256 KB, thus only adding a small monetary cost to the 
whole device. 

The recovery process works as follows. When the de- 
vice is started, we perform a whole-device scan and read 
the OOB area of all valid flash pages to reconstruct the 
mapping table of the virtual address space. If a crash is de- 
tected, we perform the following steps. The device sends 
the associated metadata in the OOB area and the physical 
addresses of flash pages in the physical address space to 
the file system. The file system then locates the proper 
metadata structures. If the physical address in a metadata 
structure is inconsistent, the file system updates it with 
the new physical address and adds the metadata write to a 
dedicated transaction. After all metadata is processed, the 
file system commits the transaction, at which point the re- 
covery process is finished. 


6 Evaluation 

In this section, we present our evaluation of nameless 
writes on an emulated nameless-writing device. Specif- 
ically, we focus on studying the following questions: 


e What are the space costs of nameless-writing devices 
compared to other FTLs? 
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Configuration Value 
SSD Size 4 GB 
Page Size 4 KB 
Block Size 256 KB 
Number of Planes 10 
Hybrid Log Block Area 5% 
Page Read Latency 25 ps 
Page Write Latency 200 jus 
Block Erase Latency 1500 jus 


Table 2: SSD Emulator Configuration. 


e What is the overall performance benefit of nameless- 
writing devices? 


e What is the write performance of nameless-writing 
devices? How and why is it different from page-level 
mapping and hybrid mapping FTLs? 


e What is the cost of in-place garbage collection and 
the overhead of wear-leveling callbacks? 


e Is crash recovery correct and what are its overheads? 


SSD Emulator: We built an SSD emulator which mod- 
els a multi-plane SSD with garbage collection and wear 
leveling as a pseudo block device based on David [4]. We 
implemented three types of FITLs: page-level mapping, 
hybrid mapping and nameless-writing on top of the PSU 
objected-oriented SSD simulator codebase [6]. Data is 
stored in memory to enable quick and accurate emulation. 
Table 2 describes the configuration we used. 

The page-level mapping FTL writes data in a log- 
structured fashion and schedules in round-robin order 
across parallel planes. It keeps a mapping for each data 
page between its logical and physical address. We assume 
(unrealistically) that this SSD has enough memory to store 
all page-level mappings. The page-level SSD serves as an 
upper-bound on performance. 

We implemented a hybrid mapping FTL similar to 
FAST [22], which uses a log block area for random data 
and one sequential log block dedicated for sequential 
streams. The rest of the device is a data block area used to 
store whole data blocks. The hybrid mapping FTL main- 
tains the page-level mapping of the log block area and the 
block-level mapping of the data block area. 

We implemented a simple garbage collection algorithm 
that recycles blocks with the least live data in page-level 
mapping and hybrid mapping FTLs, and a wear-leveling 
algorithm on all three FTLs that considers a block’s re- 
maining erase cycles and its data temperature during wear 
leveling similar to a previous wear-leveling algorithm [2]. 


System Setup: We implemented the emulated 
nameless-write device and the nameless-writing ext3 file 
system on a 64-bit Linux 2.6.33 kernel. The page-level 





Image Size | Page | Hybrid | Nameless 
328 MB 328 KB 38 KB 2.7 KB 
2 GB 2 MB 235 KB 12 KB 
10 GB 10 MB 1.1 MB 31 KB 
100 GB 100 MB 11 MB 251 KB 
400 GB 400 MB | 46 MB 1 MB 
1 TB 1 GB 118 MB 2.2 MB 


Table 3: FTL Mapping Table Size. Mapping table size of 
page-level, hybrid, and nameless-writing devices with different 
file system images. The configuration in Table 2 is used. 


mapping and the hybrid mapping SSD emulators are 
built on an unmodified 64-bit Linux 2.6.33 kernel. All 
experiments are performed on a 2.5 GHz Intel Quad Core 
CPU with 8 GB memory. 


6.1 SSD Memory Consumption 
We first study the space cost of mapping tables used by 
different SSD FTLs: nameless-writing, page-level map- 
ping, and hybrid mapping. The mapping table size of 
page-level and hybrid FTLs is calculated based on the to- 
tal size of the device, its block size, and its log block area 
size (for hybrid mapping). A nameless-writing device 
keeps a mapping table for the entire file system’s virtual 
address space. Since we map all metadata to the virtual 
block space in our nameless-writing implementation, the 
mapping table size of the nameless-writing device is de- 
pendent on the metadata size of the file system image. We 
use Impressions [3] to create typical file system images of 
sizes up to | TB and calculate their metadata sizes. 
Figure 3 shows the mapping table sizes of the three 
FTLs with different file system images produced by Im- 
pressions. Unsurprisingly, the page-level mapping has the 
highest mapping table space cost. The hybrid mapping 
has a moderate space cost; however, its mapping table size 
is still quite large: over 100 MB for a 1-TB device. The 
nameless mapping table has the lowest space cost; even 
for a 1-TB device, its mapping table uses less than 3 MB 
of space for typical file systems, reducing both cost and 
power usage. 


6.2 Application Performance 
We now present the overall application performance of 
nameless-writing, page-level mapping and hybrid map- 
ping FTLs with macro-benchmarks. We use varmail, file- 
server, and webserver from the filebench suite [29]. 
Figure 2 shows the throughput of these benchmarks. 
We see that both page-level mapping and nameless- 
writing FTLs perform better than the hybrid mapping FTL 
with varmail and fileserver. These benchmarks contain 
90.8% and 70.6% random writes, respectively. As we 
will see later in this section, the hybrid mapping FTL per- 
forms well with sequential writes and poorly with random 
writes. Thus, its throughput for these two benchmarks 
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Figure 2: Throughput of Filebench. Throughput of var- 
mail, fileserver, and webmail macro-benchmarks with page- 
level, nameless-writing, and hybrid FTLs. 


is worse than the other two FTLs. For webserver, all 
three FTLs deliver similar performance, since it contains 
only 3.8% random writes. We see a small overhead of 
the nameless-writing FTL as compared to the page-level 
mapping FTL with all benchmarks, which we will discuss 
in detail in Sections 6.5 and 6.6. 

In summary, we demonstrate that the nameless-writing 
device achieves excellent performance, roughly on par 
with the costly page-level approach, which serves as an 
upper-bound on performance. 


6.3. Basic Write Performance 

Write performance of flash-based SSDs is known to be 
much worse than read performance, with random writes 
being the performance bottleneck. Nameless writes aim to 
improve write performance of such devices by giving the 
device more data-placement freedom. We evaluate the ba- 
sic write performance of our emulated nameless-writing 
device in this section. Figure 3 shows the throughput 
of sequential writes and sustained 4-KB random writes 
with page-level mapping, hybrid mapping, and nameless- 
writing FTLs. 

First, we find that the emulated hybrid-mapping de- 
vice has a sequential throughput of 169 MB/s and a sus- 
tained 4-KB random write throughput of 2,830 IOPS. A 
widely used real middle-end SSD has sequential through- 
put of up to 70 MB/s and random throughput of up to 
3,300 IOPS [17]. Although the write performance of our 
emulator does not match this real SSD exactly, it is still in 
the ballpark of actual SSD performance, and thus useful 
in our study. The goal of our hybrid-mapping emulator is 
not to model one particular SSD perfectly but to provide 
insight into the fundamental problems of hybrid-mapped 
SSDs as compared to page-mapped and nameless SSDs. 

Second, the random write throughput of page-level 
mapping and nameless-writing FTLs is close to their se- 
quential write throughput, because both FTLs allocate 
data in a log-structured fashion, making random writes 
behave like sequential writes. The overhead of random 
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Figure 3: Sequential and Random Write Throughput. 


Throughput of sequential writes and sustained 4-KB random 


writes. Random writes are performed over a 2-GB range. 


writes with these two FTLs comes from their garbage col- 
lection process. Since whole blocks can be erased when 
they are overwritten in sequential order, garbage collec- 
tion has the lowest cost with sequential writes. By con- 
trast, garbage collection of random data may incur the cost 
of live data migration. 

Third, we notice that the random write throughput of 
the hybrid mapping FTL 1s significantly lower than that of 
the other FTLs and its own sequential write throughput. 
The poor random write performance of the hybrid map- 
ping FTL results from the costly full-merge operation and 
its corresponding garbage collection process [16]. Full 
merges are required each time a log block is filled with 
random writes, thus a dominating cost for random writes. 

One way to improve the random write performance of 
hybrid-mapped SSDs is to over-provision more log block 
space. To explore that, we vary the size of the log block 
area with the hybrid mapping FTL from 5% to 20% of 
the whole device and found that random write through- 
put gets higher as the size of the log block area increases. 
However, only the data block area reflects the effective 
size of the device, while the log block area is part of de- 
vice over-provisioning. Therefore, hybrid-mapped SSDs 
often sacrifice device space cost for better random write 
performance. Moreover, the hybrid mapping table size in- 
creases with higher log block space, requiring larger on- 
device RAM. Nameless writes achieve significantly bet- 
ter random write performance with no additional over- 
provisioning or RAM space. 

Finally, Figure 3 shows that the nameless-writing FTL 
has low overhead as compared to the page-level mapping 
FTL with sequential and random writes. We explain this 
result in more detail in Section 6.5 and 6.6. 


6.4 A Closer Look at Random Writes 

A previous study [16] and our study in the last section 
show that random writes are the major performance bot- 
tleneck of flash-based devices. We now study two subtle 
yet fundamental questions: do nameless-writing devices 
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Figure 7: Page-Level FTL Utiliza- 
tion. Break down of device utilization 
with the page-level FTL under random 
writes of different ranges. 


perform well with different kinds of random-write work- 
loads, and why do they outperform hybrid devices. 


To answer the first question, we study the effect of 
working set size on random writes. We create files of dif- 
ferent sizes and perform sustained 4-KB random writes in 
each file to model different working set sizes. Figure 4 
shows the throughput of random writes over different file 
sizes with all three FTLs. We find that the working set 
size has a large effect on random write performance of 
nameless-writing and page-level mapping FTLs. The ran- 
dom write throughput of these FT'Ls drops as the working 
set size increases. When random writes are performed 
over a small working set, they will be overwritten in full 
when the device fills and garbage collection is triggered. 
In such cases, there is a higher chance of finding blocks 
that are filled with invalid data and can be erased with no 
need to rewrite live data, thus lowering the cost of garbage 
collection. In contrast, when random writes are performed 
over a large working set, garbage collection has a higher 
cost since blocks contain more live data, which must be 
rewritten before erasing a block. 


To further understand the increasing cost of random 
writes as the working set increases, we plot the total 


Random Write Working Set (GB) 
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Figure 8: Nameless FTL Utiliza- 
tion. Break down of device utilization 
with the nameless FTL under random 
writes of different ranges. 
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Figure 6: Average Response Time 
of Synchronous Random Writes. 
4-KB random writes in a 2-GB file. 
Sync frequency represents the number of 
writes we issue before calling an fsync. 
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Figure 9: Hybrid FTL Utilization. 
Break down of device utilization with the 
hybrid FTL under random writes of dif- 
ferent ranges. 


amount of live data migrated during garbage collection 
(Figure 5) of random writes over different working set 
sizes with all three FTLs. This graph shows that as the 
working set size of random writes increases, more live 
data is migrated during garbage collection for these FTLs, 
resulting in a higher garbage collection cost and worse 
random write performance. 


Comparing the page-level mapping FIL and the 
nameless-writing FTL, we find that nameless-writing has 
slightly higher overhead when the working set size is high. 
This overhead is due to the cost of in-place garbage col- 
lection when there is wasted space in the recycled block. 
We will study this overhead in details in the next section. 


We now study the second question to further understand 
the cost of random writes with different FTLs. We break 
down the device utilization into regular writes, block 
erases, writes during merging, reads during merging, and 
device idle time. Figures 7, 8, and 9 show the stack plot of 
these costs over all three FTLs. For page-level mapping 
and nameless-writing FTLs, we see that the major cost 
comes from regular writes when random writes are per- 
formed over a small working set. When the working set 
increases, the cost of merge writes and erases increases 
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and becomes the major cost. For the hybrid mapping 
FTL, the major cost of random writes comes from migrat- 
ing live data and idle time during merging for all work- 
ing set sizes. When the hybrid mapping FTL performs a 
full merge, it reads and writes pages from different planes, 
thus creating idle time on each plane. 

In summary, we demonstrate that random write 
throughput of the nameless-writing FTL is close to that 
of the page-level mapping FTL and is significantly bet- 
ter than the hybrid mapping FTL, mainly because of the 
costly merges the hybrid mapping FTL performs for ran- 
dom writes. We also found that both nameless-writing 
and page-level mapping FTLs achieve better random write 
throughput when the working set is relatively small be- 
cause of a lower garbage collection cost. 


6.5 In-place Garbage Collection Overhead 


The performance overhead of a nameless-writing de- 
vice may come from two different device responsibili- 
ties: garbage collection and wear leveling. We study the 
overhead of in-place garbage collection in this section and 
wear-leveling overhead in the next section. 

Our implementation of the nameless-writing device 
uses an in-place merge to perform garbage collection. As 
explained in Section 4.2, when there are no waiting writes 
on the device, we may waste the space that has been re- 
cently garbage collected. We use synchronous random 
writes to study this overhead. We vary the frequency 
of calling fsync to control the amount of waiting writes 
on the device; when the sync frequency is high, there 
are fewer waiting writes on the device queue. Figure 6 
shows the average response time of 4-KB random writes 
with different sync frequencies under page-level mapping, 
nameless-writing, and hybrid mapping FTLs. We find that 
when sync frequency is high, the nameless-writing device 
has a larger overhead compared to page-level mapping. 
This overhead is due to the lack of waiting writes on the 
device to fill garbage-collected space. However, we see 
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that the average response time of the nameless-writing 
FTL is still lower than that of the hybrid mapping FTL, 
since response time is worse when the hybrid FTL per- 
forms full-merge with synchronous random writes. 


6.6 Wear-leveling Callback Overhead 


Finally, we study the overhead of wear leveling in a 
nameless-writing device. To perform wear-leveling exper- 
iments, we reduce the lifetime of SSD blocks to 50 erase 
cycles. We set the threshold of triggering wear leveling to 
be 75% of the maximal block lifetime, and set blocks that 
are under 90% of the average block remaining time to be 
candidates for wear leveling. 

We create two workloads to model different data tem- 
perature and SSD wear: a workload that first writes 3.5- 
GB data in sequential order and then overwrites the first 
500-MB area 40 times (Workload 1), and a workload that 
overwrites the first 1-GB area 40 times (Workload 2). 
Workload 2 has more hot data and triggers more wear 
leveling. We compare the throughput of these workloads 
with page-level mapping and nameless-writing FTLs in 
Figure 10. The throughput of Workload 2 is worse than 
that of Workload 1 because of its more frequent wear- 
leveling operation. Nonetheless, the performance of the 
nameless-writing FTL with both workloads has less than 
9% overhead. 

We then plot the amount of migrated live data during 
wear leveling with both FTLs in Figure 11. As expected, 
Workload 2 produces more wear-leveling migration traf- 
fic. Comparing page-level mapping to nameless-writing 
FTLs, we find that the nameless-writing FTL migrates 
more live data. When the nameless-writing FTL performs 
in-place garbage collection, it generates more migrated 
live data, as shown in Figure 5. Therefore, more erases are 
caused by garbage collection with the nameless-writing 
FTL, resulting in more wear-leveling invocation and more 
wear-leveling migration traffic. 

Migrating live nameless data in a nameless-writing 
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device creates callback traffic and additional metadata 
writes. Wear leveling in a nameless-writing device also 
adds a space overhead when it stores the remapping ta- 
ble for migrated data. We show the amount of additional 
metadata writes and the maximal size of the remapping 
table of a nameless-writing device in Figure 12. We find 
both overheads to be low with the nameless-writing de- 
vice: an addition of less than 6 MB metadata writes and a 
space cost of less than 350 KB. 

In summary, we find that both the garbage-collection 
and wear-leveling overheads caused by nameless writes 
are low. Since wear leveling is not a frequent operation 
and is often scheduled in system idle periods, we expect 
both performance and space overheads of a nameless- 
writing device to be even lower in real systems. 


6.7 Reliability 


To determine the correctness of our reliability solution, 
we inject crashes in the following points: 1) after writ- 
ing a data block and its metadata block, 2) after writing 
a data block and before updating its metadata block, 3) 
after writing a data block and updating its metadata block 
but before committing the metadata block, and 4) after the 
device migrates a data block because of wear leveling and 
before the file system processes the migration callback. In 
all cases, we successfully recover the system to a consis- 
tent state that correctly reflects all written data blocks and 
their metadata. 

Our results also show that the overhead of our crash re- 
covery process is relatively small: from 0.4 to 6 seconds, 
depending on the amount of inconsistent metadata after 
crash. With more inconsistent metadata, the overhead of 
recovery is higher. 


7 Related Work 


A large body of work on flash-based SSD FTLs and file 
systems that manage them has been proposed in recent 
years [11, 14, 16, 19, 21, 22, 25, 33]. In this section, we 
discuss the two research projects that are most related to 
nameless writes. 

Range writes [5] use an approach similar to nameless 
writes. Range writes were proposed to improve hard disk 
performance by letting the file system specify a range of 
addresses and letting the device pick the final physical ad- 
dress of a write. Instead of a range of addresses, nameless 
writes are not specified with any addresses, thus obviating 
file system allocation and moving allocation responsibil- 
ity to the device. Problems such as updating metadata af- 
ter writes in range writes also arise in nameless writes. We 
propose a segmented address space to lessen the overhead 
and the complexity of such an update process. Another 
difference is that nameless writes target devices that need 
to maintain control of data placement, such as wear level- 
ing in flash-based devices. Range writes target traditional 


hard disks that do not have such responsibilities. Data 
placement with flash-based devices is also less restricted 
than traditional hard disks, since flash-based memory has 
uniform access latency regardless of its location. 

The poor random write performance of hybrid FTLs 
has drawn attention from researchers in recent years. The 
demand-based Flash Translation Layer (DFTL) was pro- 
posed to address this problem by maintaining a page-level 
mapping table and writing data in a log-structured fashion 
[16]. DFTL stores its page-level mapping table on the de- 
vice and keeps a small portion of the mapping table in the 
device cache based on workload temporal locality. How- 
ever, for workloads that have a bigger working set than the 
device cache, swapping the cached mapping table with the 
on-device mapping table structure can be costly. There is 
also a space overhead to store the entire page-level map- 
ping table on device. We use a log-structured write order 
similar to DFTL to maximize the device’s sequential writ- 
ing capability. However, the need for a device-level map- 
ping table is obviated with nameless writes. Indirection 
is maintained only for the virtual address space, which as 
we show, requires a small space cost and can fit in the de- 
vice cache with typical file system images. Thus, we do 
not pay the space cost of storing the large page-level map- 
ping table in the device or the performance overhead of 
swapping mapping table entries. 


$ Conclusions and Future Work 


In this paper, we introduced nameless writes, a new write 
interface built to reduce the inherent costs of indirection. 
Through the implementation of nameless writes on the 
Linux ext3 file system and an emulated nameless-writing 
device, we demonstrate how to port a file system to use 
nameless writes. Through extensive evaluations, we show 
the great advantage of nameless writes: greatly reduced 
space costs and improved random-write performance. 

Porting other types of file systems to use nameless 
writes would be interesting and is a part of our future 
work. Here, we give a brief discussion about these file 
systems and the challenges we foresee in changing them 
to use nameless writes. 


Linux ext2: The Linux ext2 file system is similar to the 
ext3 file system except that it has no journaling. While we 
rely on the ordered journal mode to provide a natural or- 
dering for the metadata update process of nameless writes 
in ext3, we need to introduce an ordering on the ext2 file 
system. Our initial implementation of nameless-writing 
ext2 shows that one possible method to enforce such an 
ordering is to defer metadata writes until all the ongoing 
data writes belonging to them have finished. 


Copy-On-Write File Systems and Snapshots: As an 
alternative to journaling, copy-on-write (COW) file sys- 
tems always write out updates to new free space; when all 
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of those updates have reached the disk, a root structure is 
updated to point at the new structures, and thus include 
them in the state of the file system. COW file systems 
thus map naturally to nameless writes. All writes to free 
Space are mapped into the physical segment and issued 
namelessly; the root structure is mapped into the virtual 
segment. The write ordering is not affected, as COW file 
systems all must wait for the COW writes to complete be- 
fore issuing a write to the root structure anyway. 

One problem with COW file systems or other file sys- 
tems that support snapshots or versions is that multiple 
metadata structures can point to the same data block, 
which may result in a large amount of associated meta- 
data. We can use file system intrinsic back references, 
such as those in btrfs, or structures like Backlog [23] to 
represent associated metadata. Another problem is that 
multiple metadata blocks need to be updated after a name- 
less write. One possible way to control the number of 
metadata updates is to reduce the amount of metadata in- 
cluded in the virtual address space. 


Extent-Based File Systems: One final type of file sys- 
tems worth considering are extent-based file systems, 
such as Linux btrfs and ext4, where contiguous regions 
of a file are pointed to via (pointer, length) pairs instead 
of a single pointer per fixed-sized block. Modifying an 
extent-based file system to use nameless writes would re- 
quire a bit of work; as nameless writes of data are issued, 
the file system would not (yet) know if the data blocks will 
form one extent or many. Thus, only when the writes com- 
plete will the file system be able to determine the outcome. 
Later writes would not likely be located nearby, and thus 
to minimize the number of extents, updates should be 1s- 
sued at a single time. Extents also hint at the possibility of 
a new interface for nameless writes. Specifically it might 
be useful to provide an interface to reserve a larger con- 
tiguous region on the device; doing so would enable the 
file system to ensure that a large file was placed contigu- 
ously in physical space, and thus affords a highly compact 
extent-based representation. We plan to look into such en- 
hancements in the future. 
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Abstract 


In recent years, flash-based SSDs have grown enor- 
mously both in capacity and popularity. In high- 
performance enterprise storage applications, accelerating 
adoption of SSDs is predicated on the ability of manu- 
facturers to deliver performance that far exceeds disks 
while closing the gap in cost per gigabyte. However, 
while flash density continues to improve, other metrics 
such as a reliability, endurance, and performance are all 
declining. As a result, building larger-capacity flash- 
based SSDs that are reliable enough to be useful in en- 
terprise settings and high-performance enough to justify 
their cost will become challenging. 

In this work, we present our empirical data collected 
from 45 flash chips from 6 manufacturers and examine 
the performance trends for these raw flash devices as 
flash scales down in feature size. We use this analysis to 
predict the performance and cost characteristics of future 
SSDs. We show that future gains in density will come 
at significant drops in performance and reliability. As 
a result, SSD manufacturers and users will face a tough 
choice in trading off between cost, performance, capacity 
and reliability. 


1 Introduction 


Flash-based Solid State Drives (SSDs) have enabled a 
revolution in mobile computing and are making deep in- 
roads into data centers and high-performance computing. 
SSDs offer substantial performance improvements rela- 
tive to disk, but cost is limiting adoption in cost-sensitive 
applications and reliability is limiting adoption in higher- 
end machines. The hope of SSD manufactures is that im- 
provements in flash density through silicon feature size 
scaling (shrinking the size of a transistor) and storing 
more bits per storage cell will drive down costs and in- 
crease their adoption. Unfortunately, trends in flash tech- 
nology suggest that this is unlikely. 


While flash density in terms of bits/mm7 and feature 
size scaling continues to increase rapidly, all other fig- 
ures of merit for flash — performance, program/erase en- 
durance, energy efficiency, and data retention time — de- 
cline steeply as density rises. For example, our data show 
each additional bit per cell increases write latency by 4x 
and reduces program/erase lifetime by 10x to 20x (as 
shown in Figure 1), while providing decreasing returns 
in density (2x, 1.5x, and 1.3x between 1-,2-,3- and 4- 
bit cells, respectively). As a result, we are reaching the 
limit of what current flash management techniques can 
deliver in terms of usable capacity — we may be able to 
build more spacious SSDs, but they may be too slow and 
unreliable to be competitive against disks of similar cost 
in enterprise applications. 


This paper uses empirical data from 45 flash chips 
manufactured by six different companies to identify 
trends in flash technology scaling. We then use those 
trends to make projections about the performance and 
cost of future SSDs. We construct an idealized SSD 
model that makes optimistic assumptions about the effi- 
ciency of the flash translation layer (FTL) and shows that 
as flash continues to scale, it will be extremely difficult 
to design SSDs that reduce cost per bit without becoming 
either too slow or too unreliable (or both) as to be unus- 
able in enterprise settings. We conclude that the cost per 
bit for enterprise-class SSDs targeting general-purpose 
applications will stagnate. 


The rest of this paper is organized as follows. Sec- 
tion 2 outlines the current state of flash technology. Sec- 
tion 3 describes the architecture of our idealized SSD de- 
sign, and how we combine it with our measurements to 
project the behavior of future SSDs. Section 4 presents 
the results of this idealized model, and Section 5 con- 
cludes. 
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Figure 1: Trends in Flash’s Reliability Increasing flash’s density by adding bits to a cell or by decreasing feature size 


reduces both (a) lifetime and (b) reliability. 
2 The State of NAND Flash Memory 


Flash-based SSDs are evolving rapidly and in complex 
ways — while manufacturers drive toward higher densi- 
ties to compete with HDDs, increasing density by using 
newer, cutting edge flash chips can adversely affect per- 
formance, energy efficiency and reliability. 

To enable higher densities, manufacturers scale down 
the manufacturing feature size of these chips while also 
leveraging the technology’s ability to store multiple bits 
in each cell. Most recently on the market are 25 nm 
cells which can store three bits each (called Triple Level 
Cells, or TLC). Before TLC came 2-bit, multi-level cells 
(MLC) and 1-bit single-level cells (SLC). Techniques 
that enable four or more bits per cell are on the hori- 
zon [12]. 

Figure 2, collects the trend in price of raw flash mem- 
ory from a variety of industrial sources, and shows the 
drop in price per bit for the higher density chips. Histor- 
ically, flash cost per bit has dropped by between 40 and 
50% per year [3]. However, over the course of 2011, the 
price of flash flattened out. If flash has trouble scaling 
beyond 12nm (as some predict), the prospects for further 
cost reductions are uncertain. 

The limitations of MLC and TLC’s reliability and per- 
formance arise from their underlying structures. Each 
flash cell comprises a single transistor with an added 
layer of metal between the gate and the channel, called 
the floating gate. To change the value stored in the cell, 
the program operation applies very high voltages to its 
terminals which cause electrons to tunnel through the 
gate oxide to reach the floating gate. To erase a cell, 
the voltages are reversed, pulling the electrons off the 
floating gate. Each of these operations strains the gate 
oxide, until eventually it no longer isolates the floating 
gate, making it impossible to store charge. 

The charge on the floating gate modifies the threhold 
voltage, Vry of the transistor (1.e. the voltage at which 
the transistor turns on and off). In a programmed SLC 
cell, Vryq will be in one of two ranges (since program- 
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Figure 2: Trends in Flash Prices Flash prices reflect the 
target markets. Low density, SLC, parts target higher- 
priced markets which require more reliability while high 
density MLC and TLC are racing to compete with low- 
cost HDDs. Cameras, iPods and other mobile devices 
drive the low end. 


ming is not perfectly precise), depending on the value the 
cell stores. The two ranges have a “guard band” between 
them. Because the SLC cell only needs two ranges and 
a single guard band, both ranges and the guard band can 
be relatively wide. Increasing the number of bits stored 
from one (SLC) to two (MLC) increases the number of 
distributions from two to four, and requires two addi- 
tional guard bands. As a result, the distributions must be 
tighter and narrower. The necessity of narrow Vr distri- 
butions increases programming time, since the chip must 
make more, finer adjustments to Vry to program the cell 
correctly (as described below). At the same time, the nar- 
row guard band reduces reliability. TLC cells make this 
problem even worse: They must accomodate eight Vry 
levels and seven guard bands. 


We present empirical evidence of worsening lifetime 
and reliability of flash as it reaches higher densities. We 
collected this data from 45 flash chips made by six man- 
ufacturers spanning feature sizes from 72 nm to 25 nm. 
Our flash characterization system (described in [4]) al- 
lows us to issue requests to a raw flash chip without 
FTL interference and measure the latency of each of 
these operations with 10 ns resolution. We repeat this 
program-erase cycle (P/E cycle) until each measured 
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block reaches the rated lifetime of its chip. 


Figure | shows the chips’ rated lifetime as well as the 
bit error rate (BER) measured at that lifetime. The chips’ 
lifetimes decrease slowly with feature size, but fall pre- 
cipitously across SLC, MLC and TLC devices. While 
the error rates span a broad range, there is a clear upward 
trend as feature size shrinks and densities increase. Ap- 
plications that require more reliable or longer-term stor- 
age prefer SLC chips and those at larger feature sizes 
because they experience far fewer errors for many more 
cycles than denser technology. 


Theory and empirical evidence also indicate lower 
performance for denser chips, primarily for the program 
or write operation. Very early flash memory would apply 
a steady, high voltage to any cell being programed for a 
fixed amount of time. However, Suh et al. [10] quickly 
determined that the Incremental Step Pulse Programming 
(ISPP) would be far more effective in tolerating variation 
between cells and in environmental conditions. ISPP per- 
forms a series of program pulses each followed by a read- 
verify step. Once the cell is programmed correctly, pro- 
gramming for that cell stops. This algorithm is necessary 
because programming is a one-way operation: There is 
no way to “unprogram” a cell short of erasing the en- 
tire block, and overshooting the correct voltage results in 
storing the wrong value. ISPP remains a key algorithm in 
modern chips and is instrumental in improving the per- 
formance and reliability of higher-density cells. 


Not long after Samsung proposed MLC for NAND 
flash [5, 6], Toshiba split the two bits to separate pages so 
that the chip could program each page more quickly by 
moving the cell only halfway through the voltage range 
with each operation [11]. Much later, Samsung pro- 
vided further performance improvements to pages stored 
in the least significant bit of each cell [8]. By applying 
fast, imprecise pulses to program the fast pages, and us- 
ing fine-grain, precise pulses to program the slow pages. 
These latter pulses generate the tight Vrq distributions 
that MLC devices require, but they make programming 
much slower. All the MLC and TLC devices we tested 
split and program the bits in a cell this way. 


For SSD designers, this performance variability be- 
tween pages leads to an opportunity to easily trade off 
capacity and performance [4, 9]. The SSD can, for exam- 
ple use only the fast pages in MLC parts, sacrificing half 
their capacity but making latency comparable to SLC. 
In this work, we label such a configuration “MLC-1” — 
an MLC device using just one bit per cell. Samsung and 
Micron have formalized this trade-off in multi-level flash 
by providing single and multi-level cell modes [7] in the 
same chip and we believe FusionIO uses the property in 
the controller of their SMLC-based drives [9]. 






Controller 


Figure 3: Architecture of SSD-CDC The architecture of 
our baseline SSD. This structure remains constant while 
we scale the technology used for each flash die. 
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Channel Speed 00 MB/s [1] 


Dies per Channel (DPC) P400— 


SSD Price | $7,800 
Capacity | 320 GB 
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Table 1: Architecture and Baseline Configuration of 
SSD-CDC These parameters define the Enterprise-class, 
Constant Die Count SSD (SSD-CDC) architecture and 
starting values for the flash technology it contains. 


3 A Prototypical SSD 


To model the effect of evolving flash characteristics on 
complete SSDs we combine empirical measurement of 
flash chips in an SSD architecture with a constant die 
count called SSD-CDC. SSD-CDC’s architecture is rep- 
resentative of high-end SSDs from companies such as 
FusionIO, OCZ and Virident. We model the complexi- 
ties of FTL design by assuming optimistic constants and 
overheads that provide upper bounds on the performance 
characteristics of SSDs built with future generation flash 
technology. 

Section 3.1 describes the architecture of SSD-CDC, 
while Section 3.2 describes how we combine this model 
with our empirical data to estimate the performance of 
an SSD with fixed die area. 


3.1 SSD-CDC 


Table 1 describes the parameters of SSD-CDC’s archi- 
tecture and Figure 3 shows a block representation of 
its architecture. SSD-CDC manages an array of flash 
chips and presents a block-based interface. Given current 
trends in PCle interface performance, we assume that the 
PCIe link is not a bottleneck for our design. 
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Figure 4: Flash Chip Latency Trends Fitting an exponential to the collection of data for each cell technology, SLC-1, 
MLC-1, MLC-2 and TLC-3, allows us to project the behavior of future feature sizes for (a) read latency and (b) write 
latency. Doing the same with one standard deviation above and below the average for each chip yields a range of 
probable behavior, as shown by the error bars. 
Read Latency (Us) | Write Latency (Us) 
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Table 2: Latency Projections We generated these equations by fitting an exponential (y = Ae”) to our empirical data, 
and they allow us to project the latency of flash as a function of feature size (f) in nm. The percentages represent the 
increase in latency with Inm shrinkage. ‘The trends for TLC are less certain than for SLC or MLC, because our data 
for TLC devices is more limited. 
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Table 3: Model’s Equations These equations allow us to scale the metrics of our baseline SSD to future process 
technologies and other cell densities. 
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The SSD’s controller implements the FTL. We esti- 
mate that this management layer incurs an overhead of 
30 us for ECC and additional FTL operations. The con- 
troller coordinates 24 channels, each of which connects 
four dies to the controller via a 400 MB/s bus. To fix the 
cost of SSD-CDC, we assume a constant die count equal 
to 96 dies. 


3.2 Projections 


We now describe our future projections for seven met- 
rics of SSD-CDC: capacity, read latency, write latency, 
read bandwidth, write bandwidth, read IOPs and write 
IOPs. Table | provides baseline values for SSD-CDC 
and Table 2 summarizes the projections we make for 
the underlying flash technology. This section describes 
the formulas we use to compute each metric from the 
projections (summarized in Table 3). Some of the cal- 
culations involve making simplifying assumptions about 
SSD-CDC’s behavior. In those cases, we make the as- 
sumption that maximizes the SSD’s performance. 


Capacity Equation 1 calculates the capacity of SSD- 
CDC, by scaling the capacity of the baseline by the 
square of the ratio of the projected feature size to the 
baseline feature size (34 nm). We also scale capacity 
depending on the number of bits per cell (BPC) the pro- 
jected chip stores relative to the baseline BPC (2 —-MLC). 
In some cases, we configure SSD-CDC to store fewer 
bits per cell than a projected chip allows, as in the case 
of MLC-1. In these cases, the projected capacity would 
reflect the effective bits per cell. 


Latency To calculate the projected read and write la- 
tencies, we fit an exponential function to the empirical 
data for a given cell type. Figure 4 depicts both the 
raw latency data and the curves fitted to SLC-1, MLC- 
1, MLC-2 and TLC-3. To generate the data for MLC- 
1, which ignores the “slow” pages, we calculate the av- 
erage latency for reads and writes for the “fast” pages 
only. Other configurations supporting reduced capacity 
and improved latency, such as TLC-1 and TLC-2, would 
use a similar method. We do not present these latter con- 
figurations, because there is very little TLC data avail- 
able to create reliable predictions. Figure 4 shows each 
collection of data with the fitted exponentials for average, 
minimum and maximum, and Table 2 reports the equa- 
tions for these fitted trends. We calculate the projected 
latency by adding the values generated by these trends to 
the SSD’s overhead reported in Table 1. 


Bandwidth To find the bandwidth of our SSD, we 
must first calculate each channel’s bandwidth and then 
multiply that by the number of channels in the SSD 
(Equation 2). Each channel’s bandwidth requires an un- 
derstanding of whether channel bandwidth or per-chip 








SDD Capacity (GB) 
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Figure 5: Scaling of SSD Capacity Flash manufacturers 
increase SSDs’ capacity through both reducing feature 
size and storing more bits in each cell. 


bandwidth is the bottleneck. Equation 6 determines the 
threshold between these two cases by multiplying the 
transfer time (see Equation 5) by one less than the num- 
ber of dies on the channel. If the latency of the operation 
on the die is larger than this number, the die is the bot- 
tleneck and we use Equation 3. Otherwise, the channel’s 
bandwidth is simply the speed of its bus (Equation 4). 


IOPs The calculation for IOPs is very similar to band- 
width, except instead of using the flash’s page size in all 
cases, we also account for the access size since it effects 
the transfer time: If the access size is smaller than one 
page, the system still incurs the read or write latency of 
one entire page access. Equations 7-11 describe the cal- 
culations. 


4 Results 


This section explores the performance and cost of SSD- 
CDC in light of the flash feature size scaling trends de- 
scribed above. We explore four different cell technolo- 
gies (SLC-1, MLC-1, MLC-2, and TLC-3) and feature 
sizes scaled down from 72 nm to 6.5 nm (the smallest 
feature size targeted by industry consensus as published 
in the International Technology Roadmap for Semicon- 
ductors (ITRS) [2]), using a fixed silicon budget for flash 
Storage. 


4.1 Capacity and cost 


Figure 5 shows how SSD-CDC’s density will increase 
as the number of bits per cell rises and feature size con- 
tinues to scale. Even with the optimistic goal of scaling 
flash cells to 6.5 nm, SSD-CDC can only achieve capac- 
ities greater than 512 GB with two or more bits per cell. 
TLC allows for capacities up to 1.4 TB — pushing capac- 
ity beyond this level will require more dies. 

Since capacity is one of the key drivers in SSD design 
and because it is the only aspect of SSDs that improves 
consistently over time, we plot the remainder of the char- 
acteristics against SSD-CDC’s capacity. 
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Figure 7: SSD Bandwidth SLC will continue to be the high performance option. To obtain higher capacities without 
additional dies and cost will require a significant performance hit in terms of (a) read and (b) write bandwidth moving 
from SLC-1 to MLC-2 or TLC-3. 
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Figure 8: SSD IOPS With a fixed die area, higher capacities can only be achieved with low-performing MLC-2 and 
TLC-3 technologies, for 512B (a) reads and (c) writes and for 4kB (b) reads and (d) writes. 
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4.2 Latency 


Reduced latency is among the frequently touted advan- 
tages of flash-based SSDs over disks, but changes in 
flash technology will erode the gap between disks and 
SSDs. Figure 6 shows how both read and write latencies 
increase with SSD-CDC’s capacity. Reaching beyond 
512 GB pushes write latency to | ms for MLC-2 and 
over 2.1 ms for TLC. Read latency, rises to least 70 Us 
for MLC-2 and 100 Us for TLC. 

The data also makes clear the choices that SSD de- 
signers will face. Either SSD-CDC’s capacity stops scal- 
ing at ~582 GB or its read and write latency increases 
sharply because increasing drive capacity with fixed die 
area would necessitate switching cell technology from 
SLC-1 or MLC-1 to MLC-2 or TLC-3. With current 
trends, our SSDs could be up to 5.5x larger, but the la- 
tency will be 2.1 worse for reads and 5.4 worse for 
writes. This will reduce the write latency advantage that 
SSDs offer relative to disk from 8.3 (vs. a 7 ms disk 
access) to just 3.2. Depending on the application, this 
reduced improvement may not justify the higher cost of 
SSDs. 


4.3 Bandwidth and IOPs 


SSDs offer moderate gains in bandwidth relative to disks, 
but very large improvements in random IOP perfor- 
mance. However, increases in operation latency will 
drive down IOPs and bandwidth. 

Figure 7 illustrates the effect on bandwidth. Above 
128 GB or for multi-level technologies, bandwidth drops 
by 25% due to the latency of the program operation on 
the flash die. 

SSDs provide the largest gains relative to disks for 
small, random IOPs. We present two access sizes — the 
historically standard disk block size of 512 B and the 
most common flash page size and modern disk access 
size of 4 kB. Figure 8 presents the performance in terms 
of IOPs. When using the smaller, unaligned 512B ac- 
cesses, SLC and MLC chips must access 4 kB of data 
and the SSD must discard 88% of the accessed data. For 
TLC, there is even more wasted bandwidth because page 
size 1s 8 kB. 

When using 4kB accesses, MLC IOPs drop as density 
increases, falling by 18% between the 64 and 1024 GB 
configurations. Despite this drop, the data suggest that 
SSDs will maintain an enormous (but slowly shrinking) 
advantage relative to disk in terms of IOPs. Even the 
fastest hard drives can sustain no more than 200 IOPs, 
and the slowest SSD configuration we consider achieves 
over 32,000 IOPs. 

Figure 9 shows all parameters for an SSD made from 
MLC-2 flash normalized to SSD-CDC configured with 
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Figure 9: Scaling of all parameters While the cost of an 
MLC-based SSD remains roughly A constant, read and 
particularly write performance decline. 


currently available flash. Our projections show that the 
cost of the flash in SSD-CDC will remain roughly con- 
stant and that density will continue to increase (as long as 
flash scaling continues as projected by the ITRS). How- 
ever, they also show that access latencies will increase 
by 26% and that bandwidth (in both MB/s and IOPS) 
will drop by 21%. 


5 Conclusion 


The technology trends we have described put SSDs in 
an unusual position for a cutting-edge technology: SSDs 
will continue to improve by some metrics (notably den- 
sity and cost per bit), but everything else about them 
is poised to get worse. This makes the future of SSDs 
cloudy: While the growing capacity of SSDs and high 
IOP rates will make them attractive in many applications, 
the reduction in performance that is necessary to increase 
capacity while keeping costs in check may make it dif- 
ficult for SSDs to scale as a viable technology for some 
applications. 
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Abstract 


In a traditional block I/O path, the operating system com- 
pletes virtually all I/Os asynchronously via interrupts. 
However, performing storage I/O with ultra-low latency 
devices using next-generation non-volatile memory, it 
can be shown that polling for the completion — hence 
wasting clock cycles during the I/O — delivers higher 
performance than traditional interrupt-driven I/O. This 
paper thus argues for the synchronous completion of 
block I/O first by presenting strong empirical evidence 
showing a stack latency advantage, second by delineating 
limits with the current interrupt-driven path, and third by 
proving that synchronous completion is indeed safe and 
correct. This paper further discusses challenges and op- 
portunities introduced by synchronous I/O completion 
model for both operating system kernels and user appli- 
cations. 


1 Introduction 


When an operating system kernel processes a block sto- 
rage I/O request, the kernel usually submits and com- 
pletes the I/O request asynchronously, releasing the CPU 
to perform other tasks while the hardware device com- 
pletes the storage operation. In addition to the CPU 
cycles saved, the asynchrony provides opportunities to 
reorder and merge multiple I/O requests to better match 
the characteristics of the backing device and achieve 
higher performance. Indeed, this asynchronous I/O strat- 
egy has worked well for traditional rotating devices and 
even for NAND-based solid-state drives (SSDs). 


Future SSD devices may well utilize high-performance 
next-generation non-volatile memory (NVM), calling for 
a re-examination of the traditional asynchronous comple- 
tion model. The high performance of such devices both 
diminish the CPU cycles saved by asynchrony and re- 
duce the I/O scheduling advantage. 


This paper thus argues for the synchronous I/O comple- 
tion model by which the kernel path handling an I/O re- 
quest stays within the process context that initiated the 
I/O. Synchronous completion allows I/O requests to by- 


pass the kernel’s heavyweight asynchronous block I/O 
subsystem, reducing CPU clock cycles needed to process 
I/Os. However, a necessary condition is that the CPU has 
to spin-wait for the completion from the device, increas- 
ing the cycles used. 


Using a prototype DRAM-based storage device to mimic 
the potential performance of a very fast next-generation 
SSD, we verified that the synchronous model completes 
an individual I/O faster and consumes less CPU clock 
cycles despite having to poll. The device is fast enough 
that the spinning time is smaller than the overhead of the 
asynchronous I/O completion model. 


Interrupt-driven asynchronous completion introduces 
additional performance issues when used with very fast 
SSDs such as our prototype. Asynchronous completion 
may suffer from lower I/O rates even when scaled to 
many outstanding I/Os across many threads. We empiri- 
cally confirmed this with Linux,* and examine the sys- 
tem overheads of interrupt handling, cache pollution, 
CPU power-state transitions associated with the asyn- 
chronous model. 


We also demonstrate that the synchronous completion 
model is correct and simple with respect to maintaining 
I/O ordering when used with application interfaces such 
as non-blocking I/O and multithreading. 


We suggest that current applications may further benefit 
from the synchronous model by avoiding the non- 
blocking storage I/O interface and by reassessing buffer- 
ing strategies such as I/O prefetching. We conclude that 
with future SSDs built of next-generation NVM ele- 
ments, introducing the synchronous completion model 
could reap significant performance benefits. 


2 Background 


The commercial success of SSDs coupled with reported 
advancements of NVM technology is significantly reduc- 
ing the performance gap between mass-storage and 
memory [15]. Experimental storages device that com- 
plete an I/O within a few microseconds have been dem- 
onstrated [8]. One of the implications of this trend is that 
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the once negligible cost of I/O stack time becomes more 
relevant [8,12]. Another important trend in operating with 
SSDs is that big, sequential, batched I/O requests need no 
longer be favored over small, random I/O requests [17]. 


In the traditional block I/O architecture, the operating 
system’s block I/O subsystem performs the task of sche- 
duling I/O requests and forwarding them to block device 
drivers. This subsystem processes kernel I/O requests 
specifying the starting disk sector, target memory ad- 
dress, and size of I/O transfer, and originating from a file 
system, page cache, or user application using direct I/O. 
The block I/O subsystem schedules kernel I/O requests 
by queueing them in a kernel I/O queue and placing the 
I/O-issuing thread in an I/O wait state. The queued re- 
quests are later forwarded to a low-level block device 
driver, which translates the requests into device I/O com- 
mands specific to the backing storage device. 


Upon finishing an I/O command, a storage device is ex- 
pected to raise a hardware interrupt to inform the device 
driver of the completion of a previously submitted com- 
mand. The device driver’s interrupt service routine then 
notifies the block I/O subsystem, which subsequently 
ends the kernel I/O request by releasing the target memo- 
ry and un-blocking the thread waiting on the completion 
of the request. A storage device may handle multiple 
device commands concurrently using its own device 
queue [2,5,6], and may combine multiple completion 
interrupts, a technique called interrupt coalescing to re- 
duce overhead. 


As described the traditional block I/O subsystem uses 
asynchrony within the I/O path to save CPU cycles for 
other tasks while the storage device handles I/O com- 
mands. Also, using I/O schedulers, the kernel can reorder 
or combine multiple outstanding kernel I/O requests to 
better utilize the underlying storage media. 


This description of the traditional block storage path cap- 
tures what we will refer to as the asynchronous I/O com- 
pletion model. In this model, the kernel submits a device 
I/O command in a context distinct from the context of the 
process that originated the I/O. The hardware interrupt 
generated by the device upon command completion is 
also handled, at first, by a separate kernel context. The 
original process is later awakened to resume its execu- 
tion. 


A block I/O subsystem typically provides a set of in- 
kernel interfaces for a device driver use. In Linux, a block 
device driver is expected to implement a ‘request_fn’ 
callback that the kernel calls while executing in an inter- 
rupt context [7,10]. Linux provides another callback point 
called ‘make_request’, which is intended to be used by 
pseudo block devices, such as a ramdisk. The latter call- 
back differs from the former one in that the latter is posi- 
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tioned at highest point in the Linux’s block I/O subsys- 
tem and called within the context of the process thread. 


3 Synchronous I/O completion model 


When we say a process completes an I/O synchronously, 
we mean the kernel’s entire path handling an I/O request 
stays within the process context that initiated the I/O. A 
necessary condition for this synchronous I/O completion 
is that the CPU poll the device for completion. This pol- 
ling must be realized by a spin loop, busy-waiting the 
CPU while waiting for the completion. 


Compared to the traditional asynchronous model, syn- 
chronous completion can reduce CPU clock cycles 
needed for a kernel to process an I/O request. This reduc- 
tion comes primarily from a shortened kernel path and 
from the removal of interrupt handling, but synchronous 
completion brings with it an extra clock cycles spent in 
polling. In this section, we make the case for the syn- 
chronous completion by quantifying these overheads. We 
then discuss problems with the asynchronous model and 
argue the correctness of synchronous model. 


3.1 Prototype hardware and device driver 


For our measurements, we used a DRAM-based proto- 
type block storage device connected to the system with 
an early prototype of an NVM Express* [5] interface to 
serve as a model of a fast future SSD based on next- 
generation NVM. The device was directly attached to 
PClIe* Gen2 bus with eight lanes and with a device-based 
DMA engine handling data transfers. As described by the 
NVM Express specification the device communicates 
with the device driver via segments of main memory, 
through which the device receives commands and places 
completions. The device can instantiate multiple device 
queues and can be configured to generate hardware inter- 
rupts upon command completion. 


I/O completion method 512B xfer | 4KiB xfer 


Polling, Gen2 bus 1.5 ps 2.9 us 


Interrupt, 8Gbps bus projection 
Polling, 8Gbps bus projection 


Table 1. Time to finish an I/O command, excluding software 
time, measured for our prototype device. The numbers measure 
random-read performance with device queue depth of 1. 


Table 1 shows performance statistics for the prototype 
device. The ‘C-state’ refers to the latency when the CPU 
enters power-saving mode while the I/O is outstanding. 
The performance measured is limited by prototype 
throughput, not by anything fundamental, future SSDs 
may well feature higher throughputs. The improved per- 
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formance projection assumes a higher throughput SSD 
on a saturated PCIe Gen3 bus (8Gbps). 


We wrote a Linux device driver for the prototype hard- 
ware supporting both asynchronous and synchronous 
completion models. For the asynchronous model the 
driver implements Linux’s ‘request fn’ callback, thus 
taking the traditional path of using the stock kernel I/O 
queue. In this model, the driver uses a hardware interrupt. 
The driver executes within the interrupt context for both 
the I/O request submission and the completion. For the 
synchronous model, the driver implements Linux’s 
‘make request’ callback, bypassing most of the Linux’s 
block I/O infrastructure. In this model the driver polls for 
completion from device and hence executes within the 
context of the thread that issued the I/O. 


For this study, we assume that hardware never triggers 
internal events that incur substantially longer latency than 
average. We expect that such events are rare and can be 
easily dealt with by having operating system fall back to 
traditional asynchronous model. 


3.2 Experimental setup and methodology 


We used 64bit Fedora* 13 running 2.6.33 kernel on an 
x86 dual-socket server with 12GiB of main memory. 
Each processor socket was populated with quad-core 
2.93GHz Intel® Xeon® with 8MiB of shared L3 cache 
and 256KiB of per-core L2 cache. Intel® Hyper- 
Threading Technology was enabled totaling 16 architec- 
tural CPUs available to software. CPU frequency-scaling 
was disabled. 


For measurements we used a combination of the CPU 
timestamp counter and reports from user-level programs. 
Upon events of interest in kernel, the device driver ex- 
ecuted the ‘rdtsc’ instruction to read the CPU timestamp 
counter, whose values were later processed offline to 
produce kernel path latencies. For application IOPS (I/O 
Operations Per Second) and I/O system call completion 
latency, we used the numbers reported by ‘fio’ [1] I/O 
micro-benchmark running in user mode. 


We bypassed the file system and the buffer cache to iso- 
late the cost of the block I/O subsystem. Note that our 
objective is to measure the difference between the two 
completion models when exercising the back-end block 
I/O subsystem whose performance is not changed by the 
use of the file system or the buffer cache and would thus 
be additive to either completion model. The kernel was 
compiled with -O3 optimization and kernel preemption 
was enabled. The I/O scheduler was disabled for the 
asynchronous path by selecting ‘noop’ scheduler in order 
to make the asynchronous path as fast as possible. 


3.3 Storage stack latency comparison 


Our measurement answers following questions: 


7 Ha rdwa re device 


ah LI po 
10 ‘a i Operating syetelt 





1/O completion latency in usec 


4KiB 512B 4KiB 512B 4KiB 512B 
Async Async Async Async Sync Sync 
(C-state) (C-state) 


Figure 1. Storage stack block I/O subsystem cost comparison. 
Each bar measures application-observed I/O completion latency, 
which is broken into device hardware latency and non- 
overlapping operating system latency. Error bars represent +/- 
one standard deviation. 


e How fast does each completion path complete appli- 
cation I/O requests? 


e How much CPU time is spent by the kernel in each 
completion model? 


e How much CPU time is available to another user 
process scheduled in during an asynchronous I/O? 


Figure 1 shows that the synchronous model completes an 
I/O faster than asynchronous path in terms of absolute 
latency. The figure shows actual measured latency for the 
user application performing 4KiB and 512B random 
reads. For our fast prototype storage device the CPU 
spin-wait cost in the synchronous path is lower than the 
code-path reduction achieved by the synchronous path, 
completing a 4KiB I/O synchronously in 4.4ys versus 
7.6us for the asynchronous case. The figure breaks the 
latency into hardware time and non-hardware overlap- 
ping kernel time. The hardware time for the asynchron- 
ous path is slightly greater than that of the synchronous 
path due to interrupt delivery latency. 


Figure 2 details the latency component breakdown of the 
asynchronous kernel path. In the figure, Tu indicates the 
CPU time actually available to another user process dur- 
ing the time slot vacated during asynchronous path I/O 
completion. To measure this time as accurately as possi- 
ble, we implemented a separate user-level program sche- 
duled to run on the same CPU as the I/O benchmark. 
This program continuously checked CPU timestamps to 
detect its scheduled period at a sub-microsecond granu- 
larity. Using this program, we measured Tu to be 2.7ys 
with 4KiB transfer that the device takes 4.1 j/s to finish. 


The conclusion of the stack latency measurements is a 
strong one: the synchronous path completes I/Os faster 
and more efficiently uses the CPU. This is true despite 
spin-waiting for the duration of the I/O because the work 
the CPU performs in asynchronous path (1.e., Ta + Tb = 
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Figure 2. Latency component breakdown of asynchronous ker- 
nel path. Ta (= Ta’ + Ta’’) indicates the cost of kernel path that 
does not overlap with Td, which is the interval during which the 
device is active. Scheduling a user process P2 during the I/O 
interval incurs kernel scheduling cost, which is 7b. The CPU 
time available for P2 to make progress is Tu. For a 4KiB trans- 
fer, Ta, Td, Tb, and Tu measure 4.9, 4.1, 1.4 and 2.7j1s, respec- 
tively. 


Device: 


6.3us) is greater than the spin-waiting time of the syn- 
chronous path (4.38ys) with this fast prototype SSD. For 
smaller-sized transfers, synchronous completion by pol- 
ling wins over asynchronous completion by an even 
greater margin. 


With the synchronous completion model, improvement 
in hardware latency directly translates to improvement in 
software stack overhead. However, the same does not 
hold for the asynchronous model. For instance, using 
projected PCIe Gen3 bus performance, the spin-wait time 
is expected to be reduced from current 2.9ys to 1.5ys, 
making the synchronous path time be 3.0us, while the 
asynchronous path overhead remains the same at 6.3 ys. 
Of course the converse is also true, slow SSDs will be 
felt by the synchronous model, but not by the asynchron- 
ous model — clearly these results are most relevant for 
very low latency NVM. 


This measurement study also sets a lower bound on the 
SSD latency for which the asynchronous completion 
model recovers absolutely no useful time for other 
processes: 1.4ys (Tb in Figure 2). 


3.4 Further issues with interrupt-driven [/O 


The increased stack efficiency gained with the synchron- 
ous model for low latency storage devices does not just 
result in lower latency, but also in higher IOPS. Figure 3 
shows the IOPS scaling for increasing number of CPUs 
performing 512B randomly addressed reads. For this test, 
both the synchronous and asynchronous models use 
100% of each included CPU. The synchronous model 
does so with just a single thread per CPU, while the 
asynchronous model required up to 8 threads per CPU to 
achieve maximum IOPS. In the asynchronous model, the 
total number of threads needed increases with number of 
processors to compensate for the larger per-I/O latency. 


The synchronous model shows the best per-CPU I/O 
performance, scaling linearly with the increased number 
of CPUs up to 2 million IOPS — the hardware limitation 
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Number of CPUs 


Figure 3. Scaling of storage I/Os per second (IOPS) with in- 
creased number of CPUs. For asynchronous IOPS, I/O threads 
are added until the utilization of each CPU reaches 100%. 


of our prototype device. Even with its larger number of 
threads per CPU, the asynchronous model displays a 
significantly lower I/O rate, achieving only 60-70% of 
the synchronous model. This lower I/O rate is a result of 
inefficiencies inherent in the use of the asynchronous 
model when accessing such a low latency storage device. 
We discuss these inefficiencies in the following sections. 
It should be noted that this discussion is correct only for a 
very low latency storage device, like the one used here: 
traditional higher latency storage devices gain compelling 
efficiencies from the use the asynchronous model. 


Interrupt overhead 


The asynchronous model necessarily includes generation 
and service of an interrupt. This interrupt brings with it 
extra, otherwise unnecessary work increasing CPU utili- 
zation and therefore decreasing I/O rate on a fully loaded 
system. Another problem is that the kernel processes 
hardware interrupts at high priority. Our prototype device 
can deliver hundreds of thousands interrupts per second. 
Even if the asynchronous model driver completes mul- 
tiple outstanding I/Os during a single hardware interrupt 
invocation, the device generates interrupts fast enough to 
saturate the system and cause user noticeable delays. 
Further while coalescing interrupts reduces CPU utiliza- 
tion overhead, it also increases completion latencies for 
individual I/Os. 


Cache and TLB pollution 


The short I/O-wait period in asynchronous model can 
cause a degenerative task schedule, polluting hardware 
cache and TLBs. This is because the default task schedu- 
ler eagerly finds any runnable thread to fill in the slot 
vacated by an I/O. With our prototype, the available time 
for a schedule in thread is only 2.7us, which equals 8000 
CPU clock cycles. If the thread scheduled is lower priori- 
ty than the original thread, the original thread will likely 
be re-scheduled upon the completion of the I/O — lots of 
state swapping for little work done. Worse, thread data 
held in hardware resources such as memory cache and 
TLBs are replaced, only to be re-populated again when 
the original thread is scheduled back. 
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CPU power-state complications 


Power management used in conjunction with the asyn- 
chronous model for the short I/O-wait of our device may 
not only reduce the power saving, but also increase I/O 
completion latency. A modern processor may enter a 
power-saving “C-state’ when not loaded or lightly loaded. 
Transition among C-states incurs latency. For the asyn- 
chronous model, the CPU enters into a power saving C- 
state when the scheduler fails to find a thread to run after 
sending an I/O command. The synchronous model does 
not automatically allow this transition to a lower C-state 
since the processor is busy. 


We have measured a latency impact from C-state transi- 
tion. When the processor enters into a C-state, the asyn- 
chronous path takes an additional 2s in observed hard- 
ware latency with higher variability (Figure 1, labeled 
‘async C-state’). This additional latency is incurred only 
when the system has no other thread to schedule on the 
CPU. The end result is that a thread performing I/Os runs 
slower when it 1s the only thread active on the CPU — we 
confirmed this empirically. 


It is hard for an asynchronous model driver to fine-tune 
C-state transitions. In asynchronous path, the C-state 
transition decision is primarily made by operating sys- 
tem’s CPU scheduler or by the processor hardware itself. 
On the other hand, a device driver using synchronous 
completion can directly construct its spin-wait loop using 
instructions with power-state hints, such as mwait [3], 
better controlling C-state transitions. 


3.5 Correctness of synchronous model 


A block I/O subsystem is deemed correct when it pre- 
serves ordering requirements for I/O requests made by its 
frontend clients. Ultimately, we want to address the fol- 
lowing problem: 


A client performs I/O calls ‘A’ and ‘B’ in order, and 
its ordering requirement is that B should get to the 
device after A. Does synchronous model respect this 
requirement? 


For brevity, we assume that the client to be a user appli- 
cation using Linux I/O system calls. We also assume a 
file system and the page cache are bypassed. In fact, file 
system and page cache themselves can be considered as 
frontend clients using the block I/O subsystem. 


We start with two assumptions: 
Al. Application uses blocking I/O system calls. 
A2. Application is single threaded. 


Let us consider a single thread is submitting A and B in 
order. The operating system may preempt and schedule 
the thread on a different CPU, but it does not affect the 
ordering of I/O requests since there is only a single thread 


of execution. Therefore, it is guaranteed that B reaches to 
the device after A. 


Let us relax Al. The application order requires the thread 
to submit A before B using non-blocking interface or AIO 
[4]. With the synchronous model, this means that the 
device has already completed the I/O for A at the moment 
that the application makes another non-blocking system 
calls for B. Therefore, the synchronous model guarantees 
that B reaches to the device after A with non-blocking I/O 
interface. 


Relaxing A2, let us imagine two threads Tl and T2, each 
performing A and B respectively. In order to respect the 
application’s ordering requirement, T2 must synchronize 
with Tl to avoid a race in such a way that T2 must wait 
for T1 before submitting B. The end result is that the ker- 
nel always sees B after kernel safely completes previous- 
ly submitted A. Therefore, the synchronous model guar- 
antees the ordering with multi-threaded applications. 


The above exercise shows that an I/O barrier is unneces- 
sary in the synchronous model to guarantee I/O ordering. 
This contrasts with asynchronous model where a pro- 
gram has to rely on an I/O barrier when it needs to force 
ordering. Hence, synchronous model has a potential to 
further simplify storage I/O routines with respect to gua- 
ranteeing data durability and consistency. 


Our synchronous device driver written for Linux has 
been tested with multi-threaded applications using non- 
blocking system calls. For instance, the driver has with- 
stood many hours of TPC-C* benchmark run. The driver 
has also been heavily utilized as a system swap space. 
We believe that the synchronous completion model is 
correct and fully compatible with existing applications. 


4 Discussion 


The asynchronous model may work better in processing 
I/O requests with large transfer sizes or handling hard- 
ware stalls that cause long latencies. Hence, a favorable 
solution would be a synchronous and asynchronous hybr- 
id, where there are two kernel paths for a block device: 
the synchronous path is the fast path for small transfers 
and often used, whereas the asynchronous path is the 
slow fallback path for large transfers or hardware stalls. 


We believe that existing applications have primarily as- 
sumed the asynchronous completion model and tradition- 
al slow storage devices. Although the synchronous com- 
pletion model requires little change to existing software 
to run correctly, some changes to the operating system 
and to applications will allow for faster, more efficient 
system operation when storage is used synchronously. 
We did not attempt to re-write applications, but do sug- 
gest possible software changes. 
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Perhaps the most significant improvement that could be 
achieved for I/O intensive applications is to avoid using 
the non-blocking user I/O interface such as AIO calls 
when addressing a storage device synchronously. In this 
case, using the non-blocking interface adds overhead and 
complexity to the application without benefit because 
operating system already completes the I/O upon the 
return from a non-blocking I/O submission call. Al- 
though applications that use the non-blocking interface 
are functionally safe and correct with synchronous com- 
pletion, the use of non-blocking interface negates the 
latency and scalability gains achievable in kernel with the 
synchronous completion model. 


When the backing storage device is fast enough to com- 
plete an I/O synchronously, applications that have tradi- 
tionally self-managed I/O buffers must reevaluate their 
buffering strategy. We observe that many I/O intensive 
applications existing today, such as databases, the operat- 
ing system’s page cache, and disk-swap algorithms, em- 
ploy elaborate I/O buffering and prefetching schemes. 
Such custom I/O schemes may add overhead with little 
value for the synchronous completion model. Although 
our work in the synchronous model greatly simplifies I/O 
processing overhead in the kernel, application complexity 
may still become a bottleneck. For instance, I/O prefetch- 
ing becomes far less effective and could even hurt per- 
formance. We have found the performance of page cache 
and disk-swapper to increase when we disabled page 
cache read-ahead and swap-in clustering. 


Informing applications of the presence of synchronous 
completions is therefore necessary. For example, an 
ioctl() extension to query underlying completion model 
should help applications decide the best I/O strategy. 
Operating system processor usage statistics must account 
separately for the time spent at the driver’s spin-wait 
loop. Currently there is no accepted method of account- 
ing for this ‘spinning I/O wait’ cycles. In our prototype 
implementation, the time spent in the polling loop is 
simply accounted towards system time. This may mislead 
people to believe no I/O has been performed or to suspect 
kernel inefficiency due to increased system time. 


5S Related work 


Following the success of NAND-based storage, research 
interest has surged on the next-generation non-volatile 
memory (NVM) elements [11,14,16,19]. Although base 
materials differ, these memory elements commonly 
promise faster and simpler media access than NAND. 


Because of the DRAM-like random accessibility of many 
next-generation NVM technologies, there is abundant 
research in storage-class memories (SCM), where NVM 
is directly exposed as a physical address space. For in- 
stance, file systems have been proposed on SCM-based 
architectures [9,21]. In contrast, we approach next- 


FAST 712: 10th USENIX Conference on File and Storage Technologies 


generation NVM in a more evolutionary way, preserving 
the current hardware and software storage interface, in 
keeping with the huge body of existing applications. 


Moneta [8] is a recent effort to evaluate the design and 
impact of next-generation NVM-based SSDs. Moneta 
hardware is akin to our prototype device in spirit because 
it is a block device connected via PCIe bus. But imple- 
mentation differences enabled our hardware to perform 
faster than Moneta. Moneta also examined spinning to 
cut the kernel cost, but its description is limited to latency 
aspect. In contrast, this paper studied issues relevant to 
the viability of synchronous completion, such as IOPS 
scalability, interrupt thrashing, power state, etc. 


Interrupt-driven asynchronous completion has long been 
the only I/O model used by kernel to perform real storage 
I/Os. Storage interface standards have thus embraced 
hardware queueing techniques that further improve per- 
formance of asynchronous I/O operations [2,5,6]. How- 
ever, these are mostly effective for the devices with 
slower storage medium such as hard disk or NAND flash. 


It is a well-known strategy to choose a poll-based waiting 
primitive over an event-based one when the waiting time 
is short. A spinlock, for example, is preferred to a system 
mutex lock if the duration of the lock is held is short. 
Another example is the optional use of polling [18,20] for 
network message passing among nodes when implement- 
ing the MPI* library [13] used in high-performance com- 
puting clusters. In such systems communication latencies 
among nodes are just several microseconds due to the use 
of low-latency, high-bandwidth communication fabric 
along with a highly optimized network stack such as Re- 
mote Direct Memory Access (RDMA*). 


6 Conclusion 


This paper makes the case for the synchronous comple- 
tion of storage I/Os. When performing storage I/O with 
ultra-low latency devices employing next-generation 
non-volatile memories, polling for completion performs 
better than the traditional interrupt-driven asynchronous 
I/O path. Our conclusion has a practical importance, 
pointing to the need for kernel researchers to consider 
optimizations to the traditional kernel block storage inter- 
face with next-generation SSDs, built of next-generation 
NVM elements in mind. It is our belief that non-dramatic 
changes can reap significant benefit. 
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Abstract 


Data-protection class workloads, including backup 
and long-term retention of data, have seen a strong in- 
dustry shift from tape-based platforms to disk-based sys- 
tems. But the latter are traditionally designed to serve 
as primary storage and there has been little published 
analysis of the characteristics of backup workloads as 
they relate to the design of disk-based systems. In this 
paper, we present a comprehensive characterization of 
backup workloads by analyzing statistics and content 
metadata collected from a large set of EMC Data Domain 
backup systems in production use. This analysis is both 
broad (encompassing statistics from over 10,000 sys- 
tems) and deep (using detailed metadata traces from sev- 
eral production systems storing almost 700TB of backup 
data). We compare these systems to a detailed study of 
Microsoft primary storage systems [22], showing that 
backup storage differs significantly from their primary 
storage workload in the amount of data churn and ca- 
pacity requirements as well as the amount of redundancy 
within the data. These properties bring unique challenges 
and opportunities when designing a disk-based filesys- 
tem for backup workloads, which we explore in more 
detail using the metadata traces. In particular, the need 
to handle high churn while leveraging high data redun- 
dancy is considered by looking at deduplication unit size 
and caching efficiency. 


1 Introduction 


Characterizing and understanding filesystem content and 
workloads is imperative for the design and implementa- 
tion of effective storage systems. There have been nu- 
merous studies over the past 30 years of file system char- 
acteristics for general-purpose applications [1, 2, 3, 9, 15, 
20, 22, 26, 30, 31], but there has been little in the way of 
corresponding studies for backup systems. 

Data backups are used to protect primary data. They 
might typically consist of a full copy of the primary data 
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once per week (1.e., a weekly full), plus a daily backup 
of the files modified since the previous backup (i.e., a 
daily incremental). Historically, backup data has been 
written to tape in order to leverage tape’s low cost per 
gigabyte and allow easy transportation off site for disas- 
ter recovery. In the late 1990s, virtual tape (or “VTL’) 
was introduced, which used hard disk storage to mimic 
a tape library. This allowed for consolidation of storage 
and faster restore times. Beginning in the early 2000s, 
deduplicating storage systems [10, 34] were developed, 
which removed data redundancy and extended the bene- 
fits of disk-based backup storage by lowering the cost of 
storage and making it more efficient to copy data off-site 
over a network for disaster recovery (replication). 


The transition from tape to VTL and deduplicating 
disk-based storage has seen a strong adoption by the 
industry. In 2010 purpose-built backup appliances pro- 
tected 468PB and are projected to protect 8EB by 2015, 
by which time this will represent a $3.5B market [16]. 
This trend has made a detailed study of backup filesys- 
tem characteristics pertinent for system designers if not 
long overdue. 


In this paper we first analyze statistics from a broad set 
of 10,000+ production EMC Data Domain systems [12]. 
We also collect and analyze content-level snapshots of 
systems that, in aggregate, are to our knowledge at least 
an order of magnitude larger than anything previously 
reported. Our statistical analysis considers information 
such as file age, size, counts, deduplication effective- 
ness, compressibility, and other metrics. Comparing this 
to Meyer and Bolosky’s analysis of a large collection 
of systems in Microsoft Corp. [22], we see that backup 
workloads tend to have shorter-lived and larger files than 
primary storage. This is indicative of higher data churn 
rates, a measure of the percentage of storage capacity that 
is written and deleted per time interval (e.g., weekly), as 
well as more data sequentiality. These have implications 
for the design requirements for purpose-built backup sys- 
tems. 
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While summary statistics are useful for analyzing 
overall trends, we need more detailed information to con- 
sider topics such as performance analysis (e.g., cache 
hit rates) or assessing the effect of changes to system 
configurations (e.g., varying the unit of deduplication). 
We address this with our second experimental methodol- 
ogy, using simulation from snapshots representing con- 
tent stored on a number of individual systems. The con- 
tent metadata includes detailed information about indi- 
vidual file content, but not the content itself. For exam- 
ple, deduplicating systems will break files into a series 
of chunks with each chunk represented by a strong hash, 
sometimes referred to as a fingerprint. We collect the 
lists of chunk fingerprints and chunk sizes that represent 
each file as well as the physical layout of these chunks on 
disk. These collections represent almost 700TB of data 
and span various data types including databases, emails, 
workstation data, source code, and corporate application 
data. These allow us to analyze the stream or file-wise 
behavior of backup workloads. This type of information 
is particularly helpful in analyzing the effectiveness of 
deduplication parameters and caching algorithms. 

This study confirms and highlights the different 
requirements between backup and primary storage. 
Whereas primary storage capacities have grown rapidly 
(the total amount of digital data more than doubles ev- 
ery two years [13]), write throughput requirements have 
not needed to scale as quickly because only a small per- 
centage of the storage capacity is written every week and 
most of the bytes are longer lived. Contrast this with 
the throughput requirements of backup systems which, 
for weekly full backups, must ingest the entire primary 
capacity every week. The implication is that backup 
filesystems have had to scale their throughput to meet 
storage growth. Meeting these demands is a real chal- 
lenge, and this analysis sheds light on how deduplication 
and efficient caching can help meet that demand. 

To summarize our contributions, this paper: 


e analyzes more than 10,000 production backup sys- 
tems and reports distributions of key metrics such 
as deduplication, contents, and rate of change; 

e extensively compares backup storage systems to a 
similar study of primary storage systems; and 

e uses a novel technique for extrapolating deduplica- 
tion rates across a range of possible sizes. 


The remainder of this paper is organized into the fol- 
lowing sections: §2 background and related work, 83 
data collection and analysis techniques, 84 analysis of 
broad trends across thousands of production systems, 
$5 exploring design alternatives using detailed metadata 
traces of production systems, and 86 conclusions and im- 
plications for backup-specific filesystem design. 
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2 Background and Related Work 


We divide background into three areas: backups (82.1), 
deduplication (82.2), and data characterization (82.3). 


2.1 Backups 


Backup storage workloads are tied to the applications 
which generate them, such as EMC NetWorker or 
Symantec NetBackup. These backup software solutions 
aggregate data from online file systems and copy them to 
a backup storage device such as tape or a (deduplicating) 
disk-based storage system [7, 34]. As a result, individ- 
ual files are typically combined into large units, repre- 
senting for example all files backed up on a given night; 
these aggregates resemble UNIX “tar” files. Many other 
types of backup also exist, such as application-specific 
database backups. Backups usually run regularly, with 
the most common paradigm being weekly “full” backups 
and daily “incremental” backups. When files are modi- 
fied, the incremental backups may have large portions 
in common with earlier versions, and full backups are 
likely to have many of their comprising files completely 
unmodified, so the same data gets written to the backup 
device again and again. 


2.2 Deduplication and other Data Reduction 


In backup storage workloads the inherent high de- 
gree of data redundancy and need for high through- 
put make deduplicating techniques important. Dedu- 
plication can be performed at the granularity of en- 
tire files (e.g., Windows 2000 [5]), fixed blocks (e.g., 
Venti [29]), or variable-sized “chunks” based on content 
(e.g., LBFS [24]). In each case, a strong hash (such as 
SHA-1) of the content, 1.e., its “fingerprint,” serves as 
a unique identifier. Fingerprints are used to index con- 
tent already stored on the system and eliminate duplicate 
writes of the same data. Because content-defined chunks 
prevent small changes in content from resulting in unique 
chunks throughout the remainder of a file, and they are 
used in the backup appliances we have analyzed, we as- 
sume this model for the remainder of this paper. Backup 
data can be divided into content-defined chunks on the 
backup storage server, on the backup software intermedi- 
ary (e.g., a NetBackup server), or on the systems storing 
the original data. If chunked prior to transmission over a 
network, the fingerprints of the chunks can first be sent 
to the destination, where they are used avoid transferring 
those chunks already present [11, 24]. 

Traditional compression, such as gzip, complements 
data deduplication. We refer to such compression 
as “local” compression to distinguish it from com- 
pression obtained from identifying multiple copies of 
data, i.e., deduplication. The systems under study 
perform local compression after deduplication, com- 
bining unique chunks into “compression regions” that 
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are compressed together to improve overall data reduc- 
tion. 


There is a large body of research and commercial ef- 
forts on optimizing [14, 21, 34], scaling [8, 10], and 
improving the storage efficiency [17] of deduplicating 
backup systems. Our efforts here are mostly comple- 
mentary to that work, as we are characterizing backup 
workloads rather than designing a new storage architec- 
ture. The impact of the chunk size has been explored 
in several studies [17, 18, 22, 28], as has delta-encoding 
of content-defined chunks [18]. However, our study of 
varying chunk sizes (85.1) uses real-world workloads 
that are substantially larger than those used in previous 
studies. We also develop a novel technique for using a 
single chunk size to extrapolate deduplication at larger 
chunk sizes. This is different from the methodology of 
Kruus, et al. [17], which decides on the fly what chunk 
size to use at a particular point in a data stream, using 
the actual content of the stream. Here, we use just the 
fingerprints and sizes of chunks to form new “merged 
chunks” at a coarser granularity. We evaluate the effec- 
tiveness of this approach by comparing metrics from the 
approximated merged chunks and native chunking at dif- 
ferent sizes, then evaluate the effectiveness of chunking 
various large-scale datasets over a range of target chunk 
SIZeS. 


2.3 Data Characterization 


The closest work to ours in topic, if not depth, is Park and 
Lilja’s backup deduplication characterization study [27]. 
It uses a small number of truncated backup traces, 25GB 
each, to evaluate metrics such as rate of change and com- 
pression ratios. By comparison, our paper considers a 
larger set of substantially larger traces from production 
environments and aims at identifying filesystem trends 
related to backup storage. 


There have been many studies of primary storage char- 
acteristics [1, 2, 3, 9, 15, 20, 22, 26, 30, 31], which have 
looked at file characteristics, access patterns and caching 
behavior for primary workloads. Our study measures 
similar characteristics but for backup workloads. It is in- 
teresting to compare the different characteristics between 
backup and primary storage systems (see 84). For com- 
parison data points we use the most recent study from 
Microsoft [22], which contains a series of large-scale 
studies of workstation filesystems. There are some dif- 
ferences that arise from the difference in usage (backup 
versus day-to-day usage) and some that arise from the 
way the files are accessed (aggregates of many files ver- 
sus individual files). For example, the ability to dedupli- 
cate whole files may be useful for primary storage [5] but 
is not applicable to a backup environment in which one 
file is the concatenation of terabytes of individual files. 


3 Data Collection and Analysis Techniques 


In conducting a study of file-system data, the most en- 
compassing approach would be to take snapshots of all 
the systems’ data and archive them for evaluation and 
analysis. This type of exercise would permit numerous 
forms of interesting analysis including changes to sys- 
tem parameters such as average chunk size and tracking 
filesystem variations over time. 

Unfortunately, full-content snapshots are infeasible 
for several reasons, the primary one being the need to 
maintain data confidentiality and privacy. In addition, 
large datasets (hundreds of terabytes in size each) be- 
come infeasible to work with because of the long time to 
copy and process and the large capacity required to store 
them. The most practical way of conducting a large-scale 
study is to instead collect filesystem-level statistics and 
content metadata (1.e., data about the data). 

For this study we collect and analyze two classes 
of data with the primary aim of characterizing backup 
workloads to help design better protection storage sys- 
tems. The first class of data is autosupport reports from 
production systems. Customers can choose to configure 
their systems to automatically generate and send auto- 
supports, which contain system monitoring and diagnos- 
tic information. For our analysis, we extract aggregate 
information from the autosupports such as file statistics, 
system capacity, total bytes stored, and others. 

The second type of information collected is detailed 
information about data contained on specific production 
systems. These collections contain chunk-level meta- 
data such as chunk hash identifiers (fingerprints), sizes, 
and location on disk. Because of the effort and storage 
needed for the second type of collection, they are ob- 
tained from only a limited set of systems. 

The two sets of data complement each other: the auto- 
supports (§3.1) are limited in detail but wide in deploy- 
ment, while the content metadata snapshots (83.2) con- 
tain great detail but are limited in deployment. 


3.1 Collecting Autosupports 


The Data Domain systems that are the subject of this 
study send system data back to EMC periodically, usu- 
ally on a daily basis. These autosupport reports contain 
diagnostic and general system information that help the 
support team monitor and detect potential problems with 
deployed systems [6]. Over 10,000 of these reports are 
received per day, which makes them valuable in under- 
standing the broad characteristics of protection storage 
workloads. They include information about storage us- 
age, compression, file counts and ages, caching statistics 
and other metrics. Among other things, they can help us 
understand the distribution of deduplication rates, capac- 
ity usage, churn and file-level statistics. 
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For our analysis, we chose autosupports from a one- 
week period. From the set of autosupports, we exclude 
some systems based on certain validation criteria: a sys- 
tem must have been in service more than 3 months and 
have more than 2.5% of its capacity used. The remain- 
ing set consists of more than 10,000 systems with system 
ages ranging from 3 months to 7 years and gives a broad 
view of the usage characteristics of backup systems. 

We consider these statistics in the aggregate; there is 
no way to subdivide the 10,000 systems by content type, 
backup software, or other similar characteristics. In ad- 
dition, we must acknowledge some possible bias in the 
results. This is a study of EMC Data Domain customers 
who voluntarily provide autosupport data (the vast ma- 
jority of them do); these customers tend to use the most 
common brands of backup software and typically have 
medium to large computing environments to protect. 


3.2 Collecting Content Metadata 


In this study, we work with deduplicated stores which en- 
able us to collect content metadata more efficiently. On 
deduplicated systems a chunk may be referenced many 
times, but the detailed information about the chunk need 
be stored just once. Figure 1 shows a schematic of a 
deduplicated store. We collect the file recipes (listing 
of chunk fingerprints) for each file and then collect the 
deduplicated chunk metadata from the storage contain- 
ers, as well as sub-chunk fingerprints (labeled “sub-fps’’) 
as described below. The file recipe and per-chunk meta- 
data can be later combined to create a per-file “trace” 
comprised of a list of detailed chunk statistics as de- 
picted on the bottom right of the figure. (Note that this 
“trace” 1s not a sequence of I/O operations but rather a 
sequence of file chunk references that have been written 
to a backup appliance, from oldest to newest.) Details 
about the trace, including its efficient generation, are de- 
scribed in 83.2.3. 

In this way, the collection time and storage needed for 
the trace data is proportional to the deduplicated size. 
This can lead to almost a 10X saving for a typical backup 
storage system with 10X deduplication. In addition, 
some of the data analysis can be done on the dedupli- 
cated chunk data. This type of efficiency becomes very 
important when dealing with underlying datasets of hun- 
dreds of terabytes in size. These systems will have tens 
of billions of chunks and even the traces will be hundreds 
of gigabytes in size. 


3.2.1 Content Fields Collected 


For the content metadata snapshots, we collect the fol- 
lowing information (additional data are not discussed due 
to space limitations): 


e Per-chunk information such as size, type, SHA-1 
hash, subchunk sizes and abbreviated hashes. 
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Figure 1: Diagram of Data Collection Process 


e Per-file information such as file sizes, modification 
times, and fingerprints of each chunk in the file. 

e Disk layout information such as location and group- 
ing of chunks on disk. 


One of the main goals for these collections was to look 
at throughput and compression characteristics with dif- 
ferent system configurations. The systems studied were 
already chunked at 8KB on average with the correspond- 
ing SHA-1 hash values available. We chose to sub-chunk 
each 8KB chunk to, on average, 1KB and collected ab- 
breviated SHA-1 hashes for each 1KB sub-chunk. Sub- 
chunking allowed us to investigate deduplication rates at 
various chunk sizes smaller than the default 8KB, as de- 
scribed in 85.1. 


3.2.2 Creating Traces from Metadata Snapshots 


The collected content metadata can be used to create per- 
file traces of chunk references. These traces are the or- 
dered list of chunk metadata that comprise a file. For 
example, the simplest file trace would contain a file- 
ordered list of the chunk fingerprints and sizes that com- 
prise the file. More detailed traces might also include 
other per-chunk information such as disk location. These 
file traces can be run through a simulator or analyzed in 
other ways for metrics such as deduplication or caching 
efficiency. 

The per-file traces can be concatenated together, for 
example by file modification time (mtime), to create a 
representative trace for the entire dataset. This can be 
used to simulate reading or writing all or part of the sys- 
tem contents; our analyses in 85 are based on such traces. 

For example, to simulate a write workload onto a new 
system, we could examine the sequence of fingerprints in 
order and pack new (non-duplicate) chunks together into 
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storage containers. The storage containers could then be 
used as the unit of caching for later fingerprints in the 
sequence [34]. This layout represents a pristine storage 
system, but in reality, chunk locality is often fragmented 
because garbage collection of deleted files causes live 
chunks from different containers to be merged into new 
containers. Instead of using the pristine layout, we could 
use the container layout of chunks as provided by the 
metadata snapshot from the production system, which 
gives a more realistic caching analysis. 

To simulate a read workload, we would examine 
the sequence of fingerprints in order and measure 
cache efficiency by analyzing how many container or 
compression-region loads are required to satisfy the read. 
Compression-regions are the minimal unit of read, since 
the group of chunks have to be uncompressed together, 
but reading whole containers may improve efficiency. 
While reading the entire dataset trace would be a com- 
plete restore of all backups, perhaps more realistically, 
only the most recent full backup should be read to simu- 
late a restore. 

To approximate the read or write of one full backup 
of the trace requires knowledge of what files correspond 
to a backup. Since we don’t have the backup file cat- 
alog, we are not able to determine a full backup at file 
granularity. Instead we divide the trace into a number 
of equal sized intervals, with the interval size based on 
the deduplication rate. For instance, if the deduplication 
rate is 1OX then we estimate that there are about 10 full 
backups on the system, 1.e., the original plus 9 identical 
copies. In this example we would break the trace into 
10 intervals approximating about one backup per inter- 
val. This is an approximation: in practice, the subse- 
quent backups after the first will not be identical but will 
have some data change. But this is a reasonable approach 
for breaking the caching analysis into intervals, which al- 
lows for warming the cache and working on an estimated 
most-recent backup copy. 


3.2.3 Efficient Analysis of Filesystem Metadata 


The file trace data collected could be quite large, 
sometimes more than a terabyte in size, and analyzing 
these large collections efficiently is a challenge. Often 
the most efficient way to process the information is by 
use of out-of-core sorting. For instance, to calculate 
deduplication ratios we sort by fingerprint so that 
repeated chunks are adjacent, which then allows a single 
scan to calculate the unique chunk count. As another 
example, to calculate caching effectiveness we need to 
associate fingerprints with their location on disk. We 
first sort by fingerprint and assign the disk location of 
the first instance to all duplicates, then re-sort by file 
mtime and offset to have a time-ordered trace of chunks, 
with container locations, to evaluate. 


Even the process of merging file recipes with their 
associated chunk metadata to create a file trace would 
be prohibitively slow without sorting. We initially 1m- 
plemented this merge in a streaming fashion, looking 
up chunk locations and pre-fetching neighboring chunks 
into a cache, much as an actual deduplication system 
would handle a read. But the process was slow because 
of the index lookups and random seeks on an engineering 
workstation with a single disk. Eventually we switched 
this process to also use out-of-core sorting. We use a 
four-step process of (1) sorting the file recipes by finger- 
print, (2) sorting the chunk metadata collection by finger- 
print, (3) merging the two sets of records, and (4) sorting 
the final record list by logical position within the file. 
This generates a sequence of chunks ordered by position 
within the file, including all associated metadata. 


4 ‘Trends Across Backup Storage Systems 


We have analyzed the autosupport information from 
more than 10,000 production deduplicated filesystems, 
taken from an arbitrary week, July 24, 2011. We com- 
pare these results with published primary storage work- 
loads from Microsoft Corp. [22]. The authors of that 
study shared their data with us, which allows us to graph 
their primary workload results alongside our backup stor- 
age results. The Microsoft study looked at workstation 
filesystem characteristics for several different time peri- 
ods; we compare to their latest, a one month period in 
2009 which aggregates across 857 workstations. 

Backup storage file characteristics are significantly 
different from the Microsoft primary workload. Data- 
protection systems have generally larger, fewer and 
shorter lived files. This is an indication of more churn 
within the system but also implies more data sequential- 
ity. The following subsections detail some of these dif- 
ferences. In general, figures present both a histogram 
(probability distribution) and a cumulative distribution 
function (CDF), and when counts are presented they are 
grouped into buckets representing ranges, on a log scale, 
with labels centered under representative buckets. 


4.1 File Size 


A distinguishing characteristic between primary and 
backup workloads is file size. Figure 2 shows the file size 
distribution, weighted by bytes contained in the files, for 
both primary and backup filesystems. For backup this 
size distribution is about 3 orders of magnitude larger 
than for primary files. This is almost certainly the result 
of backup software combining individual files together 
from the primary storage system into “tar-like” collec- 
tions. Larger files reduce the likelihood of whole-file 
deduplication but increase the stream locality within the 
system. Notice that for backup files a large percentage 
of the space is used by files hundreds of gigabytes in 
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Figure 2: File size 


size. Small file optimizations that make sense for pri- 
mary storage such as embedding data in inodes or use 
of disk block fragments may not make sense for backup 
filesystems where large allocation units can provide more 
efficient metadata use. 


4.2 File and Directory Count 


File and directory counts are typically much lower in 
backup workloads. Similar to the effect of large file 
sizes, having a low file count (Figure 3(a)) results from 
having larger tar-type concatenations of protected files. 
The low directory count (Figure 3(b)) is a result of 
backup applications using catalogs to locate files. This is 
different from typical user-organized filesystems where 
a directory hierarchy is used to help order and find 
data. Looking at the ratio of file to directory count (Fig- 
ure 3(c)), we can see again that backup workloads tend 
to use a relatively flat hierarchy with several orders of 
magnitude more files per directory. 


4.3 File Age 


Figure 4 shows the distribution of file ages weighted 
by their size. For backup workloads the median age 
is about 3 weeks. This would correspond to about 1/2 
the retention period, implying data retention of about 6 
weeks. Short retention periods lead to higher data churn, 
as seen next. 


4.4 Filesystem Churn 


Filesystem churn is a measure of the percentage of 
storage that is freed and then written per time period, 
for instance in a week. Figure 5 shows a histogram of 
the weekly churn occurring across the studied backup 
systems. 

On average about 21% of the total stored data is freed 
and written per week. This high churn rate is driven by 
backup retention periods. If a backup system has a 10- 
week retention policy, about 10% of the data needs to be 
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Figure 3: File and directory counts 


written and deleted every week. The median churn rate is 
about 17%, corresponding to almost a 6-week retention 
period, which correlates well with the median byte age 
of about 3 weeks. 

This has implications for backup filesystems: such 
filesystems must be able not only to write but also re- 
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claim large amounts of space on a weekly basis. Stor- 
age technologies with limited erase cycles, such as flash 
memory, may not be a good fit for this workload without 
care to avoid arbitrary overwrites from file system clean- 
ing [23]. 

To some extent, deduplicating storage systems help 
alleviate this problem because less physical data needs 
to be cleaned and written each week. The ratio of data 
written per week to total stored is similar whether those 
are calculated from pre-deduplication file size or post- 
deduplicated storage size; this 1s expected as long as the 
deduplication ratio is relatively constant over time. 

Note also that backup churn rates increase quickly 
over time. They follow the same growth rate as the un- 
derlying primary data, (i.e., doubling every two years). 
To meet the high ingest rates, backup filesystems can 
leverage the high data redundancy of backup workloads. 
In-line deduplication of file streams can eliminate many 
of the disk writes and increase throughput. Doing so ef- 
fectively requires efficient caching, which is studied fur- 
ther in 85. 


4.5 Read vs Write Workload 


Data-protection systems are predominately write work- 
loads but do require sufficient read bandwidth in order to 
stream the full backup to tape, replicate changed data off- 
site, and provide for timely restores. Figure 6 shows the 
distribution of the ratio of bytes written vs total I/O bytes, 
excluding replication and garbage collection. About 50% 
of systems have overwhelmingly more writes than reads 
(90%+ write). Only about 20% of systems have more 
reads than writes. 

These I/O numbers underestimate read activity be- 
cause they do not include reads for replication. How- 
ever, since during replication an equal number of bytes 
are read by the source as written by the destination, the 
inclusion of these statistics might change the overall per- 
centages but not change the conclusion that writes pre- 
dominate. 

This is the opposite of systems with longer-lived bytes 
such as primary workloads, which typically have twice 
as many reads as writes [20]. The high write workloads 
are again indicative of high-churn systems where large 
percentages of the data are written every week. 


4.6 Replication 


For disaster recovery, backup data is typically replicated 
off-site to guard against site-level disasters such as fires 
or earthquakes. About 80% of the production systems 
replicate at least part of their backup data each week. 

Of the systems that replicate, Figure 7 shows the per- 
centage of bytes written in the last 7 days that are also 
replicated (either to or from the system). On average al- 
most 100% of the data is replicated on these systems. 
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Figure 4: File age 
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Figure 5: Weekly churn 


Notice that some systems replicate more data than was 
written in this time period. This can be due to several 
causes: some systems replicate to more than one desti- 
nation and some systems perform cascaded replication 
(they receive replicated data and in turn replicate it to 
another system). 

The high percentage of replicated data increases the 
need for read throughput, resulting in a slightly more bal- 
anced read to write ratio than one might expect from just 
backup operations (write once, read rarely). This implies 
that while backup systems must provide excellent write 
performance, they cannot ignore the importance of read 
performance. 

In concurrent work, cache locality for delta compres- 
sion is analyzed in the context of replication, including 
information from production autosupport results [32]. 


4.7 Capacity Utilization 


Figure 8 shows the distribution of storage usage for both 
primary and backup systems. Backup systems skew to- 
ward being more full than primary systems, with the 
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backup system modal (most frequent) utilization about 
60-70% full. In contrast primary systems are about 30- 
40% full. The gap in mean utilization may reflect differ- 
ences in the goals of the administrators of the two types 
of systems: while performance and capacity are both im- 
portant in each environment, there is a greater empha- 
sis in data protection on dollar-efficient storage, while 
primary storage administrators may stress performance. 
Also, backup administrators have more flexibility in bal- 
ancing the data protection workloads across systems, as 
they can shorten retention periods or reduce the domain 
of protected data. Achieving higher utilization helps to 
optimize the cost of overall backup storage [6]. 


4.8 Deduplication Rates 


The amount of data redundancy is one of the key charac- 
teristics of filesystem workloads and can be a key driver 
of cost efficiency in today’s storage systems. Here we 
compare the deduplication rates of backup filesystem 
workloads with those of primary storage as reported by 
Meyer and Bolosky [22]. Figure 9 indicates that dedupli- 
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Figure 8: Fullness 
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Figure 9: Deduplication 


cation rates for backup storage span a wide range across 
system and workloads with a mean of 10.9x. This is dra- 
matically higher than for primary workloads with a mean 
of about 3x in the Microsoft workload. The main differ- 
ence is that backup workloads generally retain multiple 
copies of data. 

Additionally, backups are usually contained within 
large tar-type archives that do not lend themselves to 
whole-file deduplication. When these larger files are sub- 
divided into chunks for deduplication, the chunk size can 
have widely varying effects on deduplication effective- 
ness (see 85.1). 


4.8.1 Compression 


Data Domain systems aggregate new unique chunks into 
compression regions, which are compressed as a single 
unit (approximately 128KB before compression). Since 
there is usually spatial locality between chunks that are 
written together, the compressibility of the full region is 
much greater than what might be achieved by compress- 
ing each 8KB chunk in isolation. 
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As a result, the effectiveness of post-deduplication 
compression in a backup workload will typically be com- 
parable to that of a primary workload. Figure 10 shows 
the local compression we see across production backup 
workloads, with a mean value of almost 2X as the ex- 
pected rule of thumb [19]. But as can be seen, there is 
also a large variation across systems with some data in- 
herently more random than others. 


5 Sensitivity Analyses of Deduplicating 
Backup Systems 


Deduplication has enabled the transition from tape to 
disk-based data protection. Storing multiple protected 
copies on disk is only cost effective when efficient re- 
moval of data redundancy is possible. In addition dedu- 
plication provides for higher write throughput (fewer 
disk writes), which is necessary to meet the high churn 


associated with backup storage (see 84.4). However, read 
performance can be negatively impacted by the fragmen- 
tation introduced by deduplication [25]. 


In this section we use trace-driven simulation to evalu- 
ate the effect of chunk size on deduplication rates (85.1) 
and to evaluate alternatives for caching the fingerprints 
used for detecting duplicates (§5.2). First we describe 
the metadata collections, which are used for the sensitiv- 
ity analyses, in greater detail. Table | lists the data sets 
collected and their properties, in decreasing order of pre- 
deduplication size. They span a wide range of sizes and 
deduplication rates. Most are straightforward “backup” 
workloads while one includes some data meant for long- 
term retention. They range from traces representing as 
little as 4-STB of pre-deduplicated content up to 200TB. 
The deduplication ratio (using the default 8KB target 
chunk size) also has a large range, from as little as 2.2 
to as much as 14.0; the data sets with the lowest dedupli- 
cation have relatively few full backups, with the extreme 
case being a mere 3 days worth of daily full backups. 


Deduplication over a prolonged period can be substan- 
tial if many backups are retained, but how much dedupli- 
cation is present over smaller windows, and how skewed 
is the stored data? These metrics are represented in the 
1-Wk Dedup. and MedAge columns. The former es- 
timates the average deduplication seen within a single 
week, which typically includes a full backup plus in- 
crementals. This is an approximation of the intra-full 
deduplication which cannot be determined directly be- 
cause the collected datasets do not provide information 
about full backup file boundaries. The median age is the 
point by which half the stored data was first written, and 
it provides a view into the retention and possible dedu- 
plication. For instance, half of the data in homedirs had 
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been stored for 3.5 weeks or less. With daily full back- 
ups stored 5 weeks we would expect a median age of 2.5 
weeks, but the monthly full backups compensate and in- 
crease the median. 


5.1 Effect of Varying Chunk Size 


The systems studied use a default average chunk size of 
8KB, but smaller or larger chunks are possible. Varying 
the unit of deduplication has been explored many times 
in the past, usually by chunking the data at multiple sizes 
and comparing the deduplication achieved [18, 22, 28]; 
it is also possible to vary the deduplication unit dynam- 
ically [4, 17]. The smaller the average chunk size, the 
finer-grained the deduplication. When there are long re- 
gions of unchanged data, the smaller chunk size has lit- 
tle effect, since any chunk size will deduplicate equally 
well. When there are frequent changes, spaced closer to- 
gether than a chunk, all chunks will be different and fail 
to deduplicate. But when the changes are sporadic rel- 
ative to a given chunk size, having smaller chunks can 
help to isolate the parts that have changed from the parts 
that have not. 


5.1.1 Metadata Overhead 


Since every chunk requires certain metadata to track its 
location, the aggregate overhead scales inversely with the 
chunk size. We assume a small fixed cost, 30 bytes, per 
physical chunk stored in the system and the same cost per 
logical chunk in a file recipe (where physical and logi- 
cal are post-deduplication and pre-deduplication, respec- 
tively). The 30 bytes represent the cost of a fingerprint, 
chunk length, and a small overhead for other metadata. 


Kruus, et al., described an approach to chunking 
data at multiple granularities and then selecting the 
most appropriate size for a region of data based on its 
deduplication rate [17]. They reported a reduction in 
deduplication effectiveness by a factor of TF where 
f is defined as the metadata size divided by the average 
chunk size. For instance, with 8KB chunks and 30 
bytes of metadata per chunk, this would reduce the 


effectiveness of deduplication by 0.4%. 


However, metadata increases as a function of 
both post-deduplication physical chunks and _ pre- 
deduplication logical chunks, 1.e., it is a function of 
the deduplication rate itself. If the metadata for the file 
recipes is stored outside the deduplication system, the 
formula for the overhead stated above would be correct. 
If the recipes are part of the overhead, we must account 
for the marginal costs of each logical chunk, not only the 
post-deduplication costs. Since the raw deduplication 
D is the ratio of logical to physical size (i.e., D = L/P) 
while the real deduplication D’ includes metadata costs 


(D' = PLFPIFL” we can substitute L = DP in the latter 
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equation to get: 


D 
pD! = ——_—___. 
1+ f(1+D) 


Intuitively, we are discounting the deduplication by the 
amount of metadata overhead for one copy of the phys- 
ical data and D copies of the logical data. For a dedu- 
plication rate of 10X, using this formula, this overhead 
would reduce deduplication by 4% rather than 0.4%. 

However, as chunks get much smaller, the metadata 
costs for increasing the number of chunks can dominate 
the savings from a smaller chunksize. We can calculate 
the breakeven point at which the net physical space using 
chunksize C; is no greater than using twice that chunk- 
size (Cr, where Cp = 2C)). First, we note that f; = 2 fo 
since the per-chunk overhead doubles. Then we com- 
pare the total space (physical post-deduplication P; plus 
overhead) for both chunk sizes, using a single common 
logical size L: 


Pi +2f(L+Pi)<R+f(L+P). 
Since D; = L/P; we can solve for the necessary Dj: 


p, > Pelt) 
1+ f(1—Dz) 


This inequality shows where the improvement in raw 
deduplication (not counting metadata) is at least as much 
as the increased metadata cost.! As an example, with the 
30 bytes of overhead and 10X raw deduplication at 2KB 
chunks, one would need to improve to 11.9X or more 
raw deduplication at the 1KB chunk size to fare at least 
as well. 


5.1.2. Subchunking and Merging Chunks 


We are able to take snapshots of fingerprints but not of 
content, so it is not possible to rechunk content at many 
sizes. While we could chunk data from a system at nu- 
merous sizes at the time the snapshot is created, that 
would require more processing and more storage than 
are feasible. Thus, to permit the analysis of pre-chunked 
data, for which we can later store the fingerprints but not 
the content, we take a novel approach. To get smaller 
chunks than the native 8KB size, during data collection 
we read in a chunk at its original size, sub-chunk it at a 
single smaller size (1KB), and store the fingerprints and 
sizes of the smaller sub-chunks along with the original 
chunk metadata. We can then analyze the dataset with 
1KB chunks, or merge 1KB chunks into larger chunks 


'There is also a point at which the deduplication at one size is so 
high that the overhead from doubling the metadata costs would domi- 
nate any possible improvement from better deduplication, around 67X 
for our 30-byte overhead. Also, the formula applies to a single factor 
of two but could be adjusted to allow for other chunk sizes. 
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(such as 2KB or 4KB on average). We can also merge 
the original 8KB chunks into larger units (powers of two 
up to IMB). To keep the merged chunks distinct from the 
native 8KB chunks or the 1KB sub-chunks, we will refer 
to merged chunks as mchunks. 

For a given average chunk size, the system enforces 
both minimum and maximum sizes. To create an mchunk 
within those constraints, we group a minimum number 
of chunks (or sub-chunks) to reach the minimum size, 
then determine how many additional chunks to include 
in the mchunk in a content-aware fashion, similar to how 
chunks are created in the first place. For instance, to 
merge 1KB chunks into 4KB mchunks (2KB minimum 
and 6KB maximum), we would start with enough 1KB- 
average chunks to create at least a 2KB mchunk, then 
look at the fingerprints of the next N chunks, where the 
Nth chunk considered is the last chunk that, if included 
in the mchunk, would not exceed the maximum chunk 
size of 6KB. 

At this point we have a choice among a few possi- 
ble chunks at which to separate the current mchunk from 
the next one. We need a content-defined method to se- 
lect which chunk to use as the breakpoint, similar to the 
method used for forming chunks in the first place within 
a size range. Here, we select the chunk with the highest 
value fingerprint as the breakpoint. Since fingerprints 
are uniformly distributed, and the same data will pro- 
duce the same fingerprint, this technique produces con- 
sistent results (with sizes and deduplication comparable 
to chunking the original data), as we discuss next. We 
experimented with several alternative selection methods 
with similar results. 


5.1.3. Evaluation 


A key issue in this process is evaluating the error intro- 
duced by the constraints imposed by the merging pro- 
cess. We performed two sets of experiments on the 
sub-chunking and merging. The first was done on full- 
content datasets, to allow us to quantify the difference 
between ground truth and reconstructed metadata snap- 
shots. We used two of the datasets from an earlier dedu- 
plication study [8], approximately 5TB each, to com- 
pute the “ground truth” deduplication and average chunk 
sizes. We compare these to the deduplication rate and 
average when merging chunks. (The datasets were la- 
beled “workstations” and “email” in the previous study, 
but the overall deduplication rates are reported slightly 
differently because here we include additional overhead 
for metadata; despite the similar naming, these datasets 
should not be confused with the collected snapshots in 
Table 1.) Table 2 shows these results: the average chunk 
size from merging is consistently about 2—3% lower. 
For the workstations dataset, the deduplication rate is 
slightly higher, presumably due to smaller deduplication 
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Figure 11: Deduplication obtained as a function of chunk 
size, using the merging technique. Both axes are on a log 
scale. 


units, while for the email dataset the deduplication rate is 
somewhat lower (by 4—7%) with merging than when the 
dataset is chunked at a given size. However, the numbers 
are all close enough to serve as approximations to natural 
chunking. 

The second set of experiments, shown in Figure 11, 
was performed on a subset of the collected datasets (we 
selected a few for clarity and because the trends are so 
similar). For these we have no “ground truth” other 
than statistics for the original 8K chunks, but we report 
the deduplication rates as a function of chunk size as 
the size ranges from 1K sub-chunks to 1024KB (1MB) 
mchunks. The 1KB sub-chunks are used to merge into 
2-4KB mchunks and the 8KB original chunks are used 
for the larger ones. 

Looking at both the “ground truth” datasets and the 
snapshot analyses, we see that deduplication decreases 
as the chunk size increases, a result consistent with many 
similar studies. For most of the datasets this is an im- 
provement of 20-40% for each reduction of a power 
of two, though there is some variability. As mentioned 
in §5.1.1, there is also a significant metadata overhead 
for managing smaller chunks. In Figure 11, we see 
that deduplication is consistently worse with the small- 
est chunks (1KB) than with 2KB chunks, due to these 
overheads: at that size the metadata overhead typically 
reduces deduplication by 10—20%, and in one case nearly 
a factor of two. Large chunk sizes also degrade dedupli- 
cation; in fact, the databasei dataset (not plotted) gets 
no deduplication at all for large chunks. Excluding meta- 
data costs, the datasets in Table 2 would improve dedu- 
plication by about 50% when going from 2KB to IKB 
average chunk size, but when these costs are included the 
improvement is closer to 10%; this is because for those 
datasets, the improvement in deduplication sufficiently 
compensates for the added per-chunk metadata. 
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Table 2: Comparison of the ground truth deduplication rates and chunk sizes, compared to merging sub-chunks (to 
4KB) or chunks (above 8KB), on two full-content snapshots. The target sizes refer to the desired average chunk size. 
The ground truth values for 1KB and 8KB average chunk sizes are included for completeness. 


Data and analysis such as this can be useful for assess- 
ing the appropriate unit of deduplication when variable 
chunk sizes are supported [4, 17]. 


5.2 Cache Performance Analysis 


In deduplicating systems, the performance bottleneck is 
often the lookup for duplicate chunks. Systems with hun- 
dreds of terabytes of data will have tens of billions of 
chunks. With each chunk requiring about 30 bytes of 
metadata overhead, the full index will be many hundreds 
of gigabytes. On today’s systems, indexes of this size 
will not fit in memory and thus require an on-disk index, 
which has high access latency [34]. 

Effective caching techniques are necessary to allevi- 
ate this index lookup bottleneck, and indeed there have 
been numerous efforts at improving locality (e.g., Data 
Domain’s Segment-Informed Stream Locality [34], HP’s 
sparse indexing [21], and others). These studies have in- 
dicated that leveraging stream locality in backup work- 
loads can significantly improve write performance, but 
their analyses have been limited to a small number of 
workloads and a fixed cache size. Unlike previous stud- 
ies, we analyze for both read and write workloads across 
a broader range of datasets and examine the sensitivity of 
cache performance to cache sizes and the unit of caching. 


5.2.1 Caching Effectiveness for Writes 


As seen in 84, writes are a predominant workload 
for backup storage. Achieving high write throughput 
requires avoiding expensive disk index lookups by 
having an effective chunk-hash cache. The simplest 
caching approach would be to use an LRU cache of 
chunk hashes. An LRU cache relies on duplicate chunks 
appearing within a data window that is smaller than the 
cache size. For backup workloads, duplicate chunks 
are typically found once per full backup, necessitating 
a cache sized as large as a full backup per client. This is 
prohibitively large. 

To improve caching efficiency, stream locality hints 
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can be employed. [21, 34]. Files are typically grouped 
in a similar order for each backup, and re-ordering of 
intra-file content is rare. The consistent stream-ordering 
of content can be leveraged to load the hashes of nearby 
chunks whenever an index lookup occurs. One method 
of doing so is to pack post-deduplicated chunks from the 
same stream together into disk regions. 

To investigate caching efficiency, we created a cache 
simulator to compare LRU versus using stream locality 
hints. The results for writing data are shown in Fig- 
ure 12(a). The LRU simulator does per-chunk caching 
and its results are reported in the figure with the dotted 
blue lines. The stream locality caching groups chunks 
into 4MB regions called ’containers” and its results are 
reported in that figure with solid black lines. We sim- 
ulate various cache sizes from 32MB up to 1TB where 
the cache only holds chunk fingerprints (not the chunk 
data itself).” For these simulations, we replay starting 
with the beginning of the trace to warm the cache and 
then record statistics for the final interval representing 
approximately the most recent backup. 


Note that deduplication write workloads have two 
types of compulsory misses, those when the chunk is in 
the system but not represented in the cache (duplicate 
chunks), and those for new chunks that are not in the sys- 
tem (unique chunks). This graph includes both types of 
compulsory misses. Because the misses for new chunks 
are included, the maximum hit ratio is the inverse of the 
deduplication ratio for that backup. 

Using locality hints reduces the necessary cache size 
by up to 3 orders of magnitude. Notice that LRU does 
achieve some deduplication with a relatively small cache, 
1.e., 5-40% of duplicates could be identified with a 32MB 


2To make the simulation tractable, we sampled 1 in 8 cache units, 
then scaled the memory requirement by the sampling rate. We validated 
this sampling against unsampled runs using smaller datasets. The cache 
size 1s a multiple of the cache unit for a type; therefore, data points of 
similar cache size do not align completely within Figure 12(a) and (b). 
We crop the results of Figure 12(a) at 32MB to align with Figure 12(b). 
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Figure 12: Cache results for writing or reading the final portion of each dataset. For writes, the cache consists just of 
metadata, while for reads it includes the full data as well and must be larger to have the same hit ratio. Differences in 
marks represent the datasets, while differences in color represent the granularity of caching (containers, chunks, or in 


the case of reads, compression regions). 


cache (dotted blue lines). These duplicates which occur 
relatively close together in the logical stream may rep- 
resent incremental backups that write smaller regions of 
changed data. However, effective caching 1s not typically 
achieved with the LRU cache until the cache size is many 
gigabytes in size, likely representing, at that point, a large 
portion of the unique chunks in the system. In contrast, 
using stream locality hints achieves good deduplication 
hit rates with caches down to 32MB in size (solid black 
lines across the top of the figure). Since production sys- 
tems typically handle tens to hundreds of simultaneous 
write streams, each stream with its own cache, keeping 
the per-stream cache size in the range of megabytes of 
memory is important. 


5.2.2 Caching Effectiveness for Reads 


Read performance is also important in backup systems to 
provide fast restores of data during disaster recovery. In 


this subsection, we present a read caching analysis simi- 
lar to that of the previous subsection. 

There are three main differences between the read and 
write cache analysis. The first is that read caches con- 
tain the data whereas the write caches only needs the 
chunk fingerprints. The second is that reads have only 
one kind of compulsory miss, those due to cache misses, 
while writes can also miss due to the first appearance 
of a chunk. The third is that in addition to analyzing 
stream locality hints at the container level (which rep- 
resents 4MB of chunks) we also analyze stream locality 
at the compression-region level, a 128KB grouping of 
chunks. 

Figure 12(b) shows the comparison of LRU with 
stream locality hints at the container and compression- 
region granularity for read streams. The effectiveness 
of using stream locality hints is even more exaggerated 
here than for write workloads. Stream locality hints still 
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allow cache sizes of less than 32MB for container level 
caching (solid black lines), but chunk-level LRU (dotted 
blue lines) now requires up to several terabytes of cache 
(chunk data) to achieve effective hit rates. There is now 
a 4-6 order of magnitude difference in required cache 
sizes. Compression-region caching (dashed green lines) 
is as effective as container-level for 6 of the datasets, 
but 2 show significantly degraded hit ratios. These 
two datasets are from older systems which apparently 
have significant fragmentation at the compression-region 
level, which is smoothed out at the container level. 

Fragmentation has two implications on performance. 
One is that data that appear consecutively in the logical 
stream can be dispersed physically on disk, impacting 
read performance [25]. Another is that the unit of trans- 
fer may not correspond to the unit of access; e.g., one 
may read a large unit such as a container just to access 
a small number of chunks. The impact of fragmentation 
on performance is the subject of recent and ongoing work 
(e.g., SORT [33]). 


6 Conclusion 


We have conducted a large-scale study of deduplicated 
backup storage systems to discern their main character- 
istics. The study looks both broadly at autosupport data 
from over 10,000 deployed systems and in depth at con- 
tent metadata snapshots from a few representative sys- 
tems. The broad study examines filesystem characteris- 
tics such as file sizes, ages and churn rates while the de- 
tailed study focuses on deduplication and caching effec- 
tiveness. We contrast these results with those of primary 
filesystems from Microsoft [22]. 

As can be seen from 84, backup filesystems tend 
to have fewer, larger and shorter-lived files. Back- 
ups typically comprise either large repositories, such as 
databases, or large concatenations of protected files (e.g., 
tarfiles). As backup systems ingest these primary data 
stores on a repeating schedule they must delete and clean 
an equal amount of older data to maintain within capacity 
limits. This high data churn, averaging 21% of total stor- 
age per week leads to some unique demands of backup 
storage. They must sustain high write throughput and 
scale as primary capacity grows. This is not a trivial task 
as primary capacity scales with Kryder’s law (about 100x 
per decade) but disk, network, and interconnect through- 
put have not scaled nearly as quickly [13]. To keep up 
with such workloads requires data reduction techniques, 
with deduplication being an important component of any 
data protection system. Additional techniques for reduc- 
ing the ingest to a backup system, such as change-block 
tracking, are also important as systems scale further. 

Backup workloads have two properties that help meet 
these challenging throughput demands. One is that the 
data is highly redundant between full backups. The other 
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is that the data exhibits a lot of stream locality; that is, 
neighboring chunks of data tend to remain nearby across 
backups [34]. As seen in 85.2, leveraging these two 
qualities allows for very efficient caching, with dedu- 
plication hit rates of about 90% (including compulsory 
misses from new chunks). 

Another interesting point is that backup storage work- 
loads typically have higher demands for writing than 
reading. Primary storage workloads, which have less 
churn and longer-lived data, are skewed to relatively 
more read than write workload (2:1 as a typical met- 
ric [20]). However backup storage must be able to ef- 
ficiently support read workloads, as well, to process ef- 
ficient restores when needed and to replicate data off- 
site for disaster recovery. Optimizing for reads requires a 
more sequential disk layout and can be at odds with high 
deduplication rates, but effective backup systems must 
balance between both demands, which is an interesting 
area of future work. 

The shift from tape-based backup to disk-based, 
purpose-built backup appliances has been swift and con- 
tinues at a rate of almost 80% annually. By 2015 it is 
projected that disk-based deduplicating appliances will 
protect over 8EB of data [16]. Scaling write through- 
put at the same rate as data is growing, optimizing data 
layout, and providing efficiencies in capacity usage are 
challenging and exciting problems. The workload char- 
acterizations presented in this paper are a first step at bet- 
ter understanding a vital, unique, and under-served area 
in file systems research and we hope that it will stimulate 
further exploration. 
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Abstract 


Replicating data off-site is critical for disaster recov- 
ery reasons, but the current approach of transferring 
tapes is cumbersome and error-prone. Replicating across 
a wide area network (WAN) is a promising alternative, 
but fast network connections are expensive or impracti- 
cal in many remote locations, so improved compression 
is needed to make WAN replication truly practical. We 
present a new technique for replicating backup datasets 
across a WAN that not only eliminates duplicate regions 
of files (deduplication) but also compresses similar re- 
gions of files with delta compression, which is available 
as a feature of EMC Data Domain systems. 

Our main contribution is an architecture that adds 
stream-informed delta compression to already existing 
deduplication systems and eliminates the need for new, 
persistent indexes. Unlike techniques based on know- 
ing a file’s version or that use a memory cache, our ap- 
proach achieves delta compression across all data repli- 
cated to a server at any time in the past. From a de- 
tailed analysis of datasets and hundreds of customers us- 
ing our product, we achieve an additional 2X compres- 
sion from delta compression beyond deduplication and 
local compression, which enables customers to replicate 
data that would otherwise fail to complete within their 
backup window. 


1 Introduction 


Creating regular backups is a common practice to pro- 
tect against hardware failures and user error. To protect 
against site disasters though, replicating backups to a re- 
mote repository is necessary. Shipping tapes has been 
a common practice but has the disadvantages of being 
cumbersome, open to security breaches, and difficult to 
verify success. Replicating across the WAN is a promis- 
ing alternative, but high-speed network connectivity is 
expensive and has been reserved mainly for Tier 1, pri- 
mary data, which has not been available for backup repli- 
cation. 


Moreover, WAN bandwidth has not increased with 
data growth rates. While we tend to think of important 
data residing in corporate centers or data warehouses, 
computation has become pervasive and valuable data is 
increasingly generated in remote locations such as ships, 
oil platforms, mining sites, or small branch offices. Net- 
work connectivity may either be expensive or only avail- 
able at low bandwidths. 


Since network bandwidth across the WAN is often 
a limiting factor, compressing data before transfer im- 
proves effective throughput. More data can be protected 
within a backup window, or, for the same reasons, data 
is protected against disasters more quickly. Numerous 
systems have explored data reduction techniques during 
network transfer including deduplication [14, 25, 35, 37], 
which is effective at replacing identical data regions 
with references. A promising technique to achieve ad- 
ditional compression is delta compression, which com- 
presses relative to similar regions by calculating the dif- 
ferences [17, 19, 36]. 


For both deduplication and delta compression, the goal 
is to find previous data that is either a duplicate or sim- 
ilar to data being transferred. We would like the pool 
of eligible data to include previous versions, maximiz- 
ing our potential compression gains. A standard ap- 
proach is to use a full index across the entire dataset, 
which requires space on disk, disk I/O, and ongoing up- 
dates [1, 19]. An alternative is to use a partial index 
holding data that has recently been transferred, which 
removes the persistent structures but shrinks the pool 
of eligible data [35]. Depending on the backup cycle, 
a week’s worth of data or more may have to reside in 
an index to achieve much compression. We present a 
novel technique called Stream-Informed Delta Compres- 
sion that achieves identity and delta compression across 
petabyte backup datasets with no prior knowledge of file 
versions while also reducing the index overheads of sup- 
porting both compression techniques. 
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Repeated patterns in backup datasets have been lever- 
aged to design effective caching strategies to minimize 
disk accesses for deduplication [2, 16, 20, 23, 39, 41]. 
Their key observation is that for backup workloads, cur- 
rent data streams tend to have patterns that correspond 
to an earlier stream, which can be leveraged for effec- 
tive caching. Our investigations show that the same data 
patterns exist for identifying similar data as well as du- 
plicates, without additional index structures. 

Our technique assumes that backup data is stored in a 
deduplicated format on both the backup server and re- 
mote backup repository. As streams of data are writ- 
ten to the backup server, they are divided into content- 
defined chunks, a secure fingerprint is calculated over 
each chunk, and only non-duplicate chunks are stored in 
containers devoted to that particular stream. 

We augment this standard technique by calculating a 
sketch of each non-duplicate chunk. Sketches, some- 
times referred to as resemblance hashes, are weak hashes 
of the chunk data with the property that if two chunks 
have the same sketch they are likely near-duplicates. 
These can be used during replication to identify simi- 
lar (non-identical) chunks. Instead of using a full index 
mapping sketches to chunks, we rely on the deduplica- 
tion system to load a cache with sketches from a previ- 
ous stream, which we demonstrate in Section 6 leads to 
compression close to using a full sketch index. During 
replication, chunks are deduplicated, and non-duplicate 
chunks are delta compressed relative to similar chunks 
that already reside at the remote repository. We then 
apply GZ [15] compression to the remaining bytes and 
transfer across the WAN to the repository where delta 
compressed data is first decoded and then stored. 

There are several important properties of Stream- 
Informed Delta Compression. First, we are able to 
achieve delta compression against any data previously 
stored and are not limited to a single identified file or the 
size constraints of a partial index. Since delta compres- 
sion relies upon a deduplication system to load a cache, 
there is a danger of missing potential compression, but 
our experiments demonstrate the loss is small and is a 
reasonable trade-off. 

Second, our architecture only requires one index of 
fingerprints, while traditional similarity detection re- 
quired one or more on-disk indexes for sketches [1, 19] 
or used a partial index with a decrease in compression. 
Another important consideration in minimizing the num- 
ber of indexes is that updating the index during file dele- 
tion is a complicated step, and reducing complexity/error 
cases 1S important for production systems. 

Our delta compression algorithm has been released 
commercially as a standard feature for WAN replication 
between Data Domain systems. Customers have the op- 
tion of turning on delta compression when replicating 
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between their deduplicated backup storage systems to 
achieve higher compression and correspondingly higher 
effective throughput. Analyzing statistics from hundreds 
of customers in the field shows that delta compression 
adds an additional 2X compression and enables the repli- 
cation of more data across the WAN than could otherwise 
be protected. 


2 Similarity Index Options 


To achieve the highest possible compression during 
WAN replication, we would like to find similarity 
matches across the largest possible pool of chunks. 
While previous projects have delta encoded data for 
replication, the issue of indexing sketches efficiently has 
not been explored. In this section, we discuss tradeoffs 
for three indexing options. 


2.1 Full Sketch Index 


The conceptually simplest solution is to use a full in- 
dex mapping from sketch to chunk. Unfortunately, for 
terabytes or petabytes of storage, the index is too large 
for memory and must be kept on disk, though sev- 
eral previous projects have used a full index for storing 
sketches [1, 18, 19, 40]. As an example, for a produc- 
tion deduplicated storage system with 256 TB of capac- 
ity, 8 KB average chunk size, and 16 bytes per record, 
the sketch index would be a half-TB. Sketches are ran- 
dom values so there is little locality in an index system, 
and every query will cause a disk access. 

Also, a common technique is for sketches to actually 
consist of subunits called super-features that are indexed 
independently [4, 19]. Using multiple super-features in- 
creases the probability of finding a similar chunk (see 
Section 4.1), but it also requires a disk access for each 
super-feature’s on-disk index, followed by a disk access 
for the base chunk itself. Unless the number of disk 
spindles increases, lookups will be slowed by disk ac- 
cesses. Another detail that is often neglected is that each 
index has to be updated as chunks are written and deleted 
from the system, which can be complicated in a live sys- 
tem. Moving the index to flash memory decreases lookup 
time [10] but increases hardware cost. 


2.2 Partial Sketch Index 


An alternative to a full index is to use a partial in- 
dex holding recently transmitted sketches, which would 
probably reside in memory, but could also exist on disk. 
The advantage of a partial index is that it can be cre- 
ated as data is replicated without the need for persis- 
tent data structures, and several projects [33, 35] and 
products [32] use a cache structure. Sizing and updat- 
ing a partial index are important considerations. The 
most common implementations are FIFO or LRU poli- 
cies [33], which have the advantage of finding similar 
chunks nearby in the replication stream, but will miss 
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Figure 1: Optimal compression in a backup configura- 
tion (e.g. weekly full backup) requires an index to in- 
clude at least a full backup cycle (1.0 on the x-axis). 


distant matches. For backup workloads, repeated data 
may not appear until next week’s full backup takes place, 
and enterprise organizations typically have hundreds to 
thousands of primary storage machines to be backed up 
within that time. Therefore, a partial index would have 
to be large enough to hold all of an organization’s pri- 
mary data. Riverbed [32] uses an array of disks to index 
recently transferred data. 

Another form of a partial-index is to use version infor- 
mation. As an example, rsync [37] uses file pathnames 
as the mechanism to find previous versions to perform 
compression before network transfer. 

We analyze this experimentally in Figure 1, which 
shows how much compression is achieved as index cov- 
erage increases (more details are in Section 6). The 
datasets consist of two weeks worth of backup data, 
and the combination of deduplication and delta compres- 
sion across both weeks is presented, normalized relative 
to compression achievable with a full index (right-most 
data points). This result shows a sharp increase in com- 
pression aligned with the one week boundary when suffi- 
cient data are covered by an index for both deduplication 
and delta compression. Effectively, a partial index would 
have to be nearly as large as a full index to achieve high 
compression. 


2.3 Stream-Informed Sketch Cache 


Numerous papers have explored properties of backup 
datasets and found that there are repeated patterns related 
to backup policies. These patterns have been leveraged 
in deduplication systems to prefetch fingerprints written 
sequentially by a previous data stream [2, 16, 20, 39, 41]. 
We discovered that similarity detection has the same 
stream properties as deduplication, because small edits to 
a file will probably be a similarity match to the previous 
backup of the same file, and edits may be surrounded by 
duplicate regions that can load a cache effectively. This 


exploration of similarity locality is one of the major con- 
tributions of our work. 

Following on previous work, we could build a cache 
and indexing system similar to deduplicating systems 
(i.e. Bloom filters and indexes), but a disadvantage of 
this approach is that the number of indexing structures in- 
creases with the number of super-features and adds com- 
plexity to our system. 

Instead, we leverage the same cache-loading technique 
used by our storage system for deduplication [41]. While 
loading a previous stream’s fingerprints into a cache, we 
also load sketches from the same stream. This has the 
significant advantage of removing the need for extra on- 
disk indexes that must be queried and maintained, but 
it also has the potential disadvantage of less similarity 
detection than indexing sketches directly. 

To explore these alternatives, we built a full sketch in- 
dex, a partial index, and a stream-informed cache that 
piggy-backs on deduplication infrastructure. In Section 6 
we explore trade-offs between these three techniques. 


3 Delta Replication Architecture 


While our research has focused on improving the com- 
pression and throughput of replication, it builds upon 
deduplication features of Data Domain backup storage 
systems. We first present an overview of our efficient 
caching technique before augmenting that architecture to 
support delta compression in replication. 


3.1 Stream-Informed Cache for Deduplication 


A typical deduplication storage system receives a stream 
consisting of numerous smaller files concatenated to- 
gether in a tar-like structure. The file is divided into 
content-defined chunks [22, 25], and a secure hash value 
such as SHA-1 is calculated over each chunk to repre- 
sent it as a fingerprint. The fingerprint is then compared 
against an index of fingerprints for previously stored 
chunks. If the fingerprint is new, then the chunk is stored 
and the index updated, but if the fingerprint already ex- 
ists, only a reference to the previous chunk is maintained 
in a file’s meta data. Depending on backup patterns 
and retention period, customers may experience 10X or 
higher deduplication (logical file size divided by post- 
deduplication size). 

Early deduplication storage systems ran into a fin- 
gerprint index bottleneck, because the index was too 
large to fit in memory, and index lookups limited overall 
throughput [30]. Several systems addressed this prob- 
lem by introducing caching techniques. The key insight 
of the Data Domain system [41] is that when a finger- 
print is a duplicate, the following fingerprints will likely 
match data written consecutively in an earlier stream. 
We present our basic deduplication architecture along 
with highlighted modifications in Figure 2. Fingerprints 
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Figure 2: Data Domain deduplication architecture with 
cache, Bloom filter, fingerprint index, and containers. 
Highlighted modifications show sketches stored in con- 
tainers and loaded in a stream-informed cache when fin- 
gerprints are loaded. 


and chunks are laid out in containers and can be loaded 
into a fingerprint cache. When a chunk is presented for 
storage, its fingerprint is compared against the cache, 
and on a miss, a Bloom filter is checked to determine 
whether the fingerprint is likely to exist in an on-disk 1n- 
dex. If so, the index is checked, and the corresponding 
container’s list of fingerprints is loaded into the cache. 
When eviction occurs, based on an LRU policy, all fin- 
gerprints from a container are evicted as a group. Other 
techniques for maintaining fingerprint locality have been 
presented [2, 16, 20, 23, 39], which indexed either dedu- 
plicated chunks or the logical stream of file data. 


3.2 Replication with Deduplication 


For disaster recovery purposes, it is important to repli- 
cate backups from a backup server to a remote repository. 
Replication is a common feature in storage systems [28], 
and techniques exist to synchronize versions of a reposi- 
tory while minimizing network transfer [18, 37]. In most 
cases, these approaches result in completely reconstruct- 
ing files at the destination. 

For deduplication storage systems, it is natural to only 
transfer the unique chunks and the meta data needed to 
reconstruct logical files. Although not described in de- 
tail, products such as Data Domain BOOST [13] already 
support deduplicated replication by querying the remote 
repository with fingerprints and only transferring unique 
chunks, which can be compressed with GZ or other lo- 
cal compressors. Earlier work by Eshghi et al. [14] pre- 
sented a similar approach that minimized network trans- 
fer by querying the remote repository with a hierarchical 
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Figure 3: Replication protocol modified to include delta 
compression. 


file consisting of hashes of chunks. These approaches re- 
moves duplicates in network-constrained environments. 


3.3. Delta Replication 


We expand upon standard replication for deduplication 
systems by introducing delta compression to achieve 
higher total compression than deduplication and local 
compression can achieve. We modified the basic ar- 
chitecture in Figure 2, adding sketches to the container 
meta data section. Sketches are designed so that similar 
chunks often have identical sketches. As data is written 
to a deduplicating storage node, non-duplicate chunks 
are further processed to create a sketch, which is stored 
in the container along with the fingerprint. During du- 
plicate filtering at the repository, both fingerprints and 
sketches are loaded into a cache. In later sections, we 
explore trade-offs of this architecture decision. 


3.4 Network Protocol 
Compression 


Considerations for Delta 


The main issue to address is that both source and des- 
tination must agree on and have the same base chunk, 
the source using it to encode and the destination to de- 
code. Figure 3 shows the protocol we chose for com- 
bining deduplication and delta compression. The backup 
server sends a batch of fingerprints to the remote repos- 
itory, which loads its cache, performs filtering, and re- 
sponds indicating which corresponding chunks are al- 
ready stored. For delta compression, the backup server 
then sends the sketches of unique chunks to the repos- 
itory, and the repository checks the cache for matching 
sketches. The repository responds with the fingerprint 
corresponding to the similar chunk, called the base fin- 
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gerprint, or indicates that there is no similarity match. If 
the backup server has the base fingerprint, it delta com- 
presses a chunk relative to the base before local com- 
pression and transfer. At the repository, delta encoded 
and compressed chunks are uncompressed and decoded 
in preparation for storage. 

We considered sending sketches with fingerprints in 
Phase 1, but sending sketches after filtering (Phase 2) re- 
duces wasted meta data overhead, compared to sending 
the sketches for all chunks. Fingerprint filtering occurs 
on the destination, and its cache is properly set up to find 
similar chunks. So in practice, it is best if the destination 
performs similarity lookup. 


4 Implementation Details 


In this section, we discuss: creating sketches, selecting 
a similar base chunk, and delta compression relative to a 
base. 


4.1 Similarity Detection with Sketches 


In order to delta compress chunks, we must first find 
a similar chunk already replicated. Numerous previous 
projects have used sketches to find similar matches, and 
our technique is most similar to the work of Broder et 
al. [4, 5, 6]. 

Intuitively, similarity sketches work by identifying 
“features” of a chunk that would not likely change even 
as small variations are introduced in the data. One ap- 
proach is to use a rolling hash function over all overlap- 
ping small regions of data (e.g. 32 byte windows) and 
choose as the feature the maximal hash value seen. This 
can be done with multiple different hash functions gen- 
erating multiple features. Chunks that have one or more 
features (maximal values) in common are likely to be 
very similar, but small changes to the data are unlikely 
to perturb the maximal values [4]. 

Figure 4 shows an example with data chunks 1 and 2 
that are similar to each other and have four sketch fea- 
tures (maximal values) in common. They have the same 
maximal values because the 32-byte windows that gener- 
ated the maximal values were not modified by the added 
regions (in red). If different regions had changed it could 
affect one or more of the maximal values, so different 
maximal features would be selected to represent chunk 
2. This would cause a feature match to fail. In general, 
as long as some set of the maximal values are unchanged, 
a similarity match will be possible. 

For our sketches we group multiple features together 
to form “super-features” (also called super-fingerprints 
in [19]). The super-feature value is a strong hash of the 
underlying feature values. If two chunks have an identi- 
cal super-feature then all the underlying features match. 
Using super-features helps reduce false positives and re- 
quires chunks to be more similar for a match to be found. 


Data 
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Figure 4: Similar chunks tend to have the same maximal 
values, which can be used to create features for a sketch. 


To generate multiple, independent features, we first 
generate a Rabin fingerprint Rabin_fp over rolling win- 
dows w of chunk C and compare the fingerprint against a 
mask for sampling purposes. We then permute the Rabin 
fingerprint to generate multiple values with function 7; 
with randomly generated coprime multiplier and adder 
values m and a. 


fp = Rabin fp(w) 
(fp) = (mj; *fp +a;) mod O° 


If the result of 7;(fp) is maximal for all w, then we re- 
tain the Rabin fingerprint as feature;. After calculating 
all features, a super-feature sf; is formed by taking a Ra- 
bin fingerprint over k consecutive features. We represent 
consecutive features as feature, , for beginning and end- 
ing positions b and e, respectively. 


Shi = Rabin fp (feature jek. jxk-+k—I ) 


As an example, to produce three super-features with 
k = 4 features each, we generate twelve features, and 
calculate super-features over the features 0...3, 4...7, and 
8...11. 

We performed a large number of experiments varying 
the number of features per super-feature and number of 
super-features per sketch. Increasing the number of fea- 
tures per super-feature increases the quality of matches, 
but also decreases the number of matches found. In- 
creasing the number of super-features increases the num- 
ber of matches but with increased indexing requirements. 
We typically found good similarity matches with four 
features per super-feature and a small number of super- 
features per sketch. These early experiments were com- 
pleted with datasets that consisted of multiple weeks of 
backups and had sizes varying from hundreds of giga- 
bytes to several terabytes. We explore the delta com- 
pression benefits of using more than one super-feature in 
Section 6.4. 

To perform a similarity lookup, we use each super- 
feature as a query to an index representing the corre- 
sponding super-features of previously processed chunks. 
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Chunks that match on more super-features are consid- 
ered better matches than those that match on fewer super- 
features, and experiments show a correlation between 
number of super-feature matches and delta compression. 
Other properties can be used when selecting among can- 
didates including age, status in a cache, locality on disk, 
or other criteria. 


4.2 Delta Compression 


Once a candidate chunk has been selected, it is referred 
to as the base used for delta compression, and the tar- 
get chunk currently being processed will be represented 
as a l-level delta of the base. To perform delta encod- 
ing, we use a technique based upon Xdelta [21] which is 
optimized for compressing highly similar data regions. 
We initialize the encoding by iterating through the 
base chunk, calculating a hash value at subsampled po- 
sitions, and storing the hash and offset in a temporary 
index. We then begin processing the target chunk by cal- 
culating a hash value at rolling window positions. We 
look up the hash value in the index to find a match against 
the base chunk. If there is a match, we compare bytes in 
the base and target chunks forward and backward from 
the starting position to create the longest match possible, 
which is encoded as a copy instruction. If the bytes fail 
to match, we issue an insert instruction to insert the 
target’s bytes into the output buffer, and we also add this 
region to the hash index. During the backward scans, 
we may intersect a region previously encoded. We han- 
dle this by determining whether keeping the previous in- 
struction or updating it will lead to greater compression. 
Since we are performing delta compression at the chunk 
level, as compared to the file level, we are able to main- 
tain this temporary index and output buffer in memory. 


5 Experimental Details 


We perform actual replication experiments on working 
hardware with multi-month datasets whenever practical, 
but we also use simulators to compare alternative tech- 
niques. In this section, we first present the datasets 
tested, then details of our experimental setup, and finally 
compression metrics. 


5.1 Datasets 


In this paper we use backup datasets collected over sev- 
eral months as shown in Table 1, which lists the type of 
data, total size in TB, months collected, deduplication, 
delta, GZ, and total compression. Total compression is 
measured as data bytes divided by replicated bytes (after 
all types of compression) and is equivalent to the multi- 
plication of deduplication, delta, and GZ. For the com- 
pression values, we used results from our default con- 
figuration. These datasets were previously studied for 
deduplication [11, 27] but not delta compression. Note 
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that our deduplication results vary slightly (within 5%) 
from Dong et al. [11] due to implementation differences. 

We also highlight steady-state delta compression after 
a seeding period has completed. For all of the datasets 
except Email, seeding was one week, and the period af- 
ter seeding is the remaining months of data. Customers 
often handle initial seeding by keeping pairs of replicat- 
ing machines on a LAN (when new hardware is installed) 
until seeding completes and then move the destination 
machine to the long-term location. Alternatively, seed- 
ing can be handled using backups available at the des- 
tination. While there is some delta compression within 
the seeding period, delta compression increases once a 
set of base chunks become available, and the period after 
seeding is indicative of what customers will experience 
for the lifetime of their storage. 

These datasets consist of large “tar” type files repre- 
senting many user files or objects concatenated together 
by backup software. Except for Email (explained be- 
low), these datasets consist of a repeated pattern of a 
weekly full backup followed by six, smaller incremen- 
tal backups. 

Source Code Repository: Backups from a version con- 
trol repository containing source code. 

Workstations: Backups from 16 desktops used by soft- 
ware engineers. 

Email: Backups from a Microsoft Exchange server. Un- 
like the other datasets, Email consists of daily full back- 
ups, and the seeding phase consists of a single backup 
instead of a week’s worth of data. 

System Logs: Backups from a server’s /var directory, 
mostly consisting of emails stored by a list server. 
Home Directories: Backups from software engineers’ 
home directories containing source code, office docu- 
ments, etc. 


5.2 Delta Replication Experiments 


Many of our experiments were performed on production 
hardware replicating between pairs of systems in our lab. 
We actually used a variety of machines that varied in stor- 
age capacity (350 GB - 5 TB), RAM (4 GB - 16 GB), 
and computational resources (2 - 8 cores). We have con- 
trolled internal parameters and confirmed that disparate 
machines produce consistent results. Unless specifically 
stated, we ran all experiments with 3 super-features per 
sketch, 12 MB sketch cache, 8 KB average chunk size, 
and 4.5 MB containers holding meta data and locally 
compressed chunks. When applying local compression, 
we create compression regions of approximately 128 KB 
of chunks. 


5.3. Simulator Experiments 


We compare our technique of replication with a finger- 
print index and sketch cache against two alternative ar- 
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Table 1: Summary of datasets. Deduplication, delta, and GZ compression factors are shown across the entire dataset 
as well as for the period after seeding, which was typically one week. 


chitectures: 1) full fingerprint and sketch indexes and 2) 
a partial-index of fingerprints and sketches implementing 
an LRU eviction policy. 

Before building the production system, we actually 
started with a simplified simulator that maintained a full 
index of fingerprints and sketches in memory. To de- 
crease memory overheads, we use 12 bytes per finger- 
print as compared to larger fingerprints necessary for a 
product such as a 20 byte SHA-1. In a separate analy- 
sis, we found that 12 byte fingerprints only cause a small 
number of collisions out of the hundreds of millions of 
chunks processed. To maximize throughput and simplify 
the code, we try to keep the entire index in RAM. Also, 
instead of implementing a full replication protocol, we 
record statistics as the client deduplicates and delta com- 
presses chunks without network transfer. Our simulator 
did not apply local compression with the same technique 
as our replication system, so comparisons to the simula- 
tor do not include local compression. 


Our second simulator explores the issues of data lo- 
cality and index requirements with an LRU partial-index 
of fingerprints and sketches. This partial-index is a mod- 
ification of the previous simulator with the addition of 
parameters to control the index size. The partial-index 
only holds meta data, fingerprints and sketches, which 
each reference chunks stored on disk. The fingerprint 
and sketches for a chunk maintain the same age in the 
partial-index, so they are added and evicted as a unit. Ifa 
fingerprint is referenced as a duplicate of incoming data 
or a sketch is selected as the best similarity match for 
compression, the age is updated. 


5.4 Compression Metrics 


Our focus is on improving replication across the WAN, 
specifically for customers with low network connectiv- 
ity. For that reason, we mostly focus on compression 
metrics, though we also present throughput results from 
experiments and hundreds of customer systems. 

We tend to use the term compression generically to re- 
fer to any type of data reduction during replication such 
as deduplication, delta compression, or local compres- 
sion with an algorithm such as GZ. Compression is cal- 
culated as original_bytes/post_compression_bytes. How- 


ever, we generally use the term total compression to 
mean data reduction achieved by deduplication, delta, 
and GZ in combination. As an example, if the deduplica- 
tion factor is 10X, delta is 2X, and GZ is 1.5X then total 
compression is 30X since these techniques have a mul- 
tiplicative effect. A compression factor of LX indicates 
no data reduction. In order to show different datasets 
on the same graph, we often plot normalized compres- 
sion, which is total compression of a particular exper- 
iment divided by the maximum total compression. As 
explained in Section 6, maximum compression is mea- 
sured using a full index or the appropriate baseline for 
each experiment and dataset. Normalized compression 
is in the range (0...1]. 


6 Results 


In this section, we begin by exploring parameters of our 
system (cache size, number of super-features, and multi- 
level delta) and then compare Stream-Informed Delta 
Compression to alternative techniques such as using a 
full sketch index or maintaining a partial-index of re- 
cently used sketches. We then investigate the interaction 
of delta and GZ compression. 


6.1 Sketch Cache Size 


When designing our cache-based delta system, sizing the 
cache is an important consideration. If datasets have 
similarity locality that matches up perfectly to dedupli- 
cation locality, then a cache holding a single container 
could theoretically achieve all of the possible compres- 
sion. With a larger cache, similarity matches may be 
found to chunks loaded in the recent past, with com- 
pression growing with cache size. We found that the hit 
rate is maximized with a cache sized consistently across 
datasets even though Home Directories is over twice as 
large as the other datasets. 

We evaluated the sketch cache hit rate in Figure 5, by 
increasing the sketch cache size (x-axis) and measuring 
the number of similarity matches found in the cache rel- 
ative to using a full index. The sketch cache size refers to 
the amount of memory required to hold sketches, which 
is approximately 12 bytes per super-feature. Therefore a 
cache of 12 MB corresponds to | million super-features 
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Figure 5: Locality-informed sketch cache hit rate reaches 
its maximum with a cache of 12-16 MB. 


and 1/3 million chunks, since we have 3 super-features 
per sketch by default. 

With a cache of 4 MB, the hit rate is between 50% 
and 90% of the maximum, and the hit rate grows until 
around 12 or 16 MB, when it is quite close to the final 
value we show at 20 MB. Email showed the worst hit 
rate, maxing at around 80%, which is still a reasonably 
high result. Enail has worse deduplication locality than 
the other datasets and this impacts delta compression in 
a data-dependent manner. Regardless of the dataset size 
(5 TB up to 13 TB) and deduplication (5-37X), all of 
the datasets reached their maximum hit rates with a sim- 
ilarly sized cache. Our implementation has a minimum 
cache size related to the large batches of chunks trans- 
ferred during replication as well as the multiple stages of 
pipelined replication that either add data to the cache or 
need to check for matches 1n the cache. 

Although it may be reasonable to use a larger cache 
in enterprise-sized servers, note that our experiments are 
for single datasets at a time. A storage server would 
normally handle numerous simultaneous streams, each 
needing a portion of the cache, so our single-stream 
results should be scaled accordingly. Since the lo- 
cality of delta compression for backup datasets corre- 
sponds closely to identity locality, only a small cache 
is needed, and our memory requirements should scale 
well with the number of backup streams. Our intuition 
is that users/applications often make small modifications 
to files, so duplicate chunks indicate a region of the pre- 
vious version of a file that is likely to provide delta com- 
pression. 


6.2 Delta Encoding 


Our similarity detection technique is able to find matches 
for most chunks during replication and achieves high en- 
coding compression on those chunks. The second col- 
umn of Table 2 shows the percentage of bytes after dedu- 
plication that are delta encoded after seeding. 55-82% 
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Dedupe aes Factor Factor 
Source Code 
Workstations 


Email 


System Logs 
Home Dirs 


Table 2: Datasets, percent of post-deduplication bytes 
delta encoded, delta encoding factor, and resulting delta 
factor for each dataset, which corresponds to Table 1 af- 
ter seeding. 





of bytes undergo delta encoding with a median of 77%. 
Delta encoding factors vary from 8.91-30.11X with a 
median of 15.65X. As an example of how the delta fac- 
tor is calculated for System Logs, 77% of bytes after 
deduplication are delta encoded to oe of their origi- 
nal size, and 23% of bytes are not encoded. Therefore, 


= oe 3.55 (rounding in the tables affects accuracy), 
15-63 ost 
which is equivalent to dividing post-deduplication bytes 


by post-delta compression bytes. 


While further improvements in encoding compression 
are likely possible, we are already shrinking delta en- 
coded chunks to a small fraction of their original size. 
On the other hand, increasing the fraction of chunks that 
receive delta encoding could lead to larger savings. 


6.3 Maulti- vs 1-Level Delta 


While we have described the delta compression algo- 
rithm as representing a chunk as a 1-level delta from a 
base, because we decode chunks at the remote repository, 
our delta replication is actually multi-level. Specifically, 
consider a delta encoded chunk B transferred across the 
network that is then decoded using base chunk C and 
stored. At a later time, another delta encoded chunk A 
is transferred across the network that uses B as a base. 
Although B exists in a decoded form, it was previously a 
1-level delta encoded chunk, so A is effectively a 2-level 
delta because A referenced B, which referenced C. Our 
replication system, like many, does not bound the delta 
level, since chunks are decoded at the destination, and we 
effectively achieve multi-level delta across the network. 


As compared to replicating delta compressed chunks, 
storing such chunks introduces extra complexity. Al- 
though n-level delta is possible for any value of n, de- 
coding an n-level delta entails n reads of the appropriate 
base chunks, which can be inefficient in a storage sys- 
tem. For this reason, a delta storage system [1] may only 
support 1— or 2-level delta encodings to bound decode 
times. 
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Figure 6: Multi-level delta compression improves 6-30% 
beyond 1-level delta. 


To compare the benefits of multi- and 1—level delta, 
we studied the compression differences. We modified 
our replication system so that after a chunk is delta en- 
coded, its sketch is then invalidated. This ensures that 
delta encoded chunks will never be selected as the base 
for encoding other chunks, preventing 2-level or higher 
deltas. 

In Figure 6, multi- and 1—level delta are compared, 
with multi-level delta adding 1.03 - 1.18X additional 
compression. As an example, Source Code increased 
from 178X to 194X total compression (deduplication, 
delta, and GZ), which is roughly similar to adding a sec- 
ond super-feature as discussed in Section 6.4. These re- 
sults also highlight that 1-level delta is a reasonable ap- 
proximation to multi-level, when multi-level 1s impracti- 
cal. Unlike a storage system, we are able to get the com- 
pression benefits of multi-level without the slowdowns 
related to decoding n-level delta chunks. 


6.4 Sketch Index vs Stream-Informed Sketch Cache 


We next investigate how our stream-informed caching 
technique compares to the alternative of a full sketch in- 
dex. We expect that using a full sketch index could find 
potential matches that a sketch cache will miss because 
of imperfect locality, but maintaining indexes for billions 
of stored chunks adds significant complexity. We explore 
the compression trade-offs by comparing delta replica- 
tion with a cache against a simulator with complete in- 
dexes for each super-feature. 

Figure 7 compares compression results for the index 
and cache options. The lowest region of each vertical bar 
is the amount of compression achieved by deduplication, 
and because of differences in implementation between 
our product and simulator, these numbers vary slightly. 
The next four sets of colored regions show how much ex- 
tra compression is achieved by using 1-4 super-features. 
The cache experiments ran on production hardware, and 
the cache was fixed at 12 MB. Also, our simulator with 


1 — — 


Cc G4 RSS I xX 7 
SL 
oO Z| Mestatetalstetetets! 
on 0 8 RRR poe 
; { | 
ROKK KR 
Ww Rog Mageteces ececatece! 
@O ROSY Neceaeees Seoctatee 
= Rey TXRKXI RRR KERR SRO 
q f 
ReooKS ERY SSI SSOP SRK 
a PS KS KRRREK KKM RSS 
PRK SRR RRP PROCS 
Reo ERG SSOP ROC 
atecetete: SRE SOS PRR Poe _| 
PRR KxSxK q RRP xx oY 
. Deo EeROF SSOP oN 
d 
oO Deseo SeSeSORRES SSSR Cpe 
O RXR Roo 4 SSeS C56 
atecetete SRP { SOoSeR PRC —< 
PRR Kx SSP { OCR PRK 
Bee RF OSS 
atecetete: SSD d SOeoeSoohs 
O RRR SOC PRS ORK 4 
Poecoed COPIER ROO 
®M O 4 RXR ROCSOC ROE + 7 
{ SSSR RRRKG 
N SSOP 54 
SRP 
— RRR RRR 
= OCR POR 
RRR RRRE 
Ow SX KKK KK 
RSPR 
S RRR 
0 2 RRS 4 
= U. 








rf: 


Email Sys Logs Home Dirs 


3 SF @zzz 
4S zz 


Workst _ 


Dedupe KS“) 
1 SF sees 
a 


Src Code 





Figure 7: Using a stream-informed sketch cache results 
in nearly as much compression as using a full index, 
and using two super-features with a cache achieves more 
compression than a single super-feature index. 


index did not apply local compression, so only dedupli- 
cation and delta compression are analyzed. 


In all cases, using a single super-feature adds sig- 
nificant compression beyond deduplication alone, with 
decreasing benefit as the number of super-features in- 
creases. Although using a sketch cache generally has 
lower delta compression than an index, the results are 
reasonably close (Workstations with | super-feature 
and a cache is within 14% of the index with 1| super- 
feature). Importantly, we can use more than one super- 
feature in our cache with little additional overhead com- 
pared to multiple on-disk indexes for super-features. Us- 
ing a cache with two or more super-features achieves 
greater compression than a single index, which is why 
we decided to pursue the caching technique. 


An interesting anomaly is that Source Code achieved 
higher delta compression with a stream-informed sketch 
cache than a full index, even though we would ex- 
pect a limited-size cache to be an approximation to a 
full index. We found that Source Code and Home 
Directories had extremely high numbers of potential 
similarity matches (> 10,000) all with the same num- 
ber of super-feature matches, which was likely due to 
repeated headers in source files!. Selecting among the 
candidates leads to differences in delta compression, and 
the selection made by a stream-informed cache leads 
to higher compression for Source Code than our tie- 
breaking technique for the index (most recently written). 


'This caused slowed throughput for Home Directories, and 
those experiments would not have completed without adjusting 
the sketch index. We modified the sketch index for all Home 
Directories results such that if a sketch has more than 128 similarity 
matches, the current sketch is not added to the index. 
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Home Directories had similar compression with ei- 
ther a cache or index. 

Another unexpected result is that increasing the num- 
ber of super-features used with our cache did not always 
increase total compression. Since we fix the size of our 
cache at 12 MB, when the number of super-features 1n- 
creases, fewer chunks are represented in the cache. The 
optimal cache size tends to increase with the number of 
super-features, but the index results indicate that adding 
super-features has diminishing benefit. 


6.5 Partial-index of Fingerprints and Sketches 


As a comparison to previous work, we implemented a 
partial-index of fingerprints and sketches that updates 
ages when either a chunk’s fingerprint or sketch is ref- 
erenced and evicts from the partial-index with an LRU 
policy. While it is somewhat unfair to compare a partial- 
index to our technique, it is useful for analyzing the scal- 
ability of such systems. 

To focus on the data patterns of typical backups, we 
limit this experiment to two full weeks of each dataset, 
which typically consists of a full backup followed by 
six incremental backups followed by another full and six 
incremental backups. For Email, we selected two full 
backups a week apart, since a full backup was created 
each day. 

Figure | (presented in Section 2) shows the amount 
of compression achieved (deduplication and delta) as the 
partial-index size increases along the x-axis, which is 
measured as the fraction of the first week’s data kept in 
a partial-index. When the partial-index is able to hold 
more than a week’s worth of data (1.0 on x-axis), com- 
pression jumps dramatically as the second week’s data 
compresses against the first week’s data. To highlight 
this property, the horizontal axis is normalized based 
on the first week’s deduplication rate, since the post- 
deduplication size affects how many fingerprints and 
sketches must be maintained. 

These results highlight that techniques using a partial- 
index must hold a full backup cycle’s worth of data (e.g. 
at least one full backup) to achieve significant compres- 
sion, while our delta compression technique uses a com- 
bination of a deduplication index and stream-informed 
sketch cache to achieve high compression with small 
memory overheads. For storage systems with large back- 
ups or backups from numerous sources, our algorithm 
would tend to scale memory requirements better, since 
Figure 5 demonstrates that we only need a fixed-size 
cache regardless of the dataset size. 


6.6 Interaction of Delta and Local Compression 


Our replication system includes local compressors such 
as GZ that can be selected by the administrator. During 
replication, chunks are first deduplicated and many of the 
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Name  |_No Delta || With Delta_|| Delta 
| GZ | Delta | GZ || Improve. 
Source Code 


Workstations 


Email 
System Logs 
Home Dirs 


/Median | 312 || 355 [286 | 2.08 


Table 3: Delta encoding overlaps with the effectiveness 
of GZ, but total compression including delta is still a 2X 
improvement beyond alternative approaches. Results are 
after initial seeding. 


remaining chunks are delta compressed. All remaining 
data bytes (delta compressed or not) are then compressed 
with a local compressor. A subtle detail of delta com- 
pression is that it reduces redundancies within a chunk 
that appear in the previous base chunk and within itself, 
which overlaps with compression that local compressors 
might find. 


We evaluated the impact of delta compression on GZ 
and total compression by rerunning our replication ex- 
periments with GZ enabled and delta compression e1- 
ther enabled or disabled. Table 3 shows GZ compression 
achieved both with and without delta after seeding. Re- 
sults with delta enabled are the same as Table 1. Dedu- 
plication factors are the same with or without delta en- 
abled, and are removed from the table for space reasons. 
GZ and delta overlap by 5-50% (7.20X vs 3.99X for GZ 
on Source Code), but using delta in combination with 
GZ still provides improved total compression (2.08X for 
Source Code). The overlap of local compression and 
delta compression varies with dataset and type of local 
compressor selected (GZ, LZ, etc.), but we typically see 
significant advantages to using both techniques in com- 
bination with deduplication. 


6.7 WAN Replication Improvement 


We performed numerous replication experiments mea- 
suring network and effective throughput. Figure 8 shows 
a representative replication result for the Workstations 
dataset. Throughput was throttled at T3 speed (44 Mb/s) 
and measured every 10 minutes. We found effective 
throughput is 1-2 orders of magnitude faster than net- 
work throughput, which corresponds to total compres- 
sion. Although throughput could be further improved 
with better pipelining and buffering, this result highlights 
that compression boosts effective throughput and reduces 
the time until transfer 1s complete. 
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Figure 8: Effective throughput is higher than network 
throughput due to compression during replication. 


7 Performance Characteristics 


In this section, we discuss overheads of delta compres- 
sion and limitations of stream-informed delta compres- 
Sion. 


7.1 Delta Overheads 


First, capacity overheads for storing sketches are rela- 
tively small. Each chunk stored in a container (after 
deduplication) also has a sketch added to the meta data 
section of the container, which is less than 20 bytes, but 
our stream-informed approach removes the need for a 
full on-disk index of sketches. 


There are also two performance overheads added to 
the system: sketching on the write path and reading sim- 
ilar base chunks to perform delta compression. First, 
incoming data is sketched before being written to disk, 
which introduces a 20% slowdown in unoptimized tests. 
The sketching stage happens after deduplication, so after 
the first full backup, later backups experience less slow- 
down since a large fraction of the data is duplicate and 
does not need to be sketched. As CPU cores increase 
and pipelining is further optimized, this overhead may 
become negligible. 


The second, and more sizable throughput overhead, 
is during replication when similar chunks are read from 
disk to serve as the base for delta compression, which 
limits our throughput by the read speed of our storage 
system. Our read performance varies with the number 
of disk spindles and data locality, which we are continu- 
ing to investigate. Remote sites also tend to have lower- 
end hardware with fewer disk spindles than data ware- 
houses. For these reasons, we recommend turning on 
delta compression for low bandwidth connections (6.3 
Mb/s or slower), where delta compression is not the bot- 
tleneck and extra delta compression multiplies the effec- 
tive throughput. Also, it should be noted that read over- 
heads only take place when delta compression occurs, so 


% Resources 


> 
> 
: KS 
> 
——, ee 
4 > 
p q KRRKKY 
_ * eecegeceees 
PXKNKK ROKK, 
WOK b 
WEN XA 
h xx IN x xX 








Figure 9: CPU and disk utilization grows fairly linearly 
on the remote repository as the number of replication 
streams increases. Error bars indicate a standard devi- 
ation. 


if no similarity matches are found, read overhead will be 
minimal. 


Effectively, we are trading computation and I/O re- 
sources for higher network throughput, and we expect 
computation and I/O to improve at a faster rate than net- 
work speeds increase, especially in remote areas. We 
expect this tradeoff to become more important in the fu- 
ture as data sizes continue to grow. Improvements to our 
technique and hardware may also expand the applicabil- 
ity of delta replication to a larger range of customers. 


Delta compression increases computational and I/O 
demands on both the backup server and remote reposi- 
tory. We set up an experiment replicating from twelve 
small backup servers (2 cores and 3-disk RAID) to a 
medium-sized remote repository (8 cores and 14-disk 
RAID) with a T1 connection (1.5 Mb/s). At the backup 
servers, the CPU and disk I/O overheads were modest 
(2% and 4% respectively). At the remote repository, 
CPU and disk overhead scaled linearly as the number of 
replication streams grew from | to 12 as shown in Fig- 
ure 9. Measurements were made over every 30 second 
period after the seeding phase, and standard deviation er- 
ror bars are shown. These results suggest that dozens of 
backup servers could be aggegated to one medium-sized 
remote repository. As future work, we would like to in- 
crease the scaling tests. 


7.2 Stream-Informed Cache Limitations 


Since we do not have a full sketch index, loss of cache 
locality translates to a loss in potential compression. 
While earlier experiments showed that stream-informed 
caching is effective, those experiments were on individ- 
ual datasets. In a realistic environment, multiple datasets 
have to share a cache, and garbage collection further de- 
grades locality on disk because live chunks from differ- 
ent containers and datasets can be merged into new con- 
tainers. 
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We ran an experiment with a midsize storage appli- 
ance with a 288 MB cache sized to handle approximately 
20 replicating datasets. The experiment consisted of 
replicating a real dataset to this appliance while vary- 
ing the number of synthetic datasets also replicated be- 
tween 0, 24, and 49. This test was performed with three 
real datasets. The synthetic datasets were generated with 
an internal tool that had deduplication of 12X and delta 
compression of 1.7X, which exercises our caching in- 
frastructure in a realistic manner. When the number of 
datasets was increased to 25 (1 real and 24 synthetic), 
delta compression decreased 0%, 6% and 12% among 
the three real datasets relative to a baseline of replicat- 
ing each real dataset individually. Increasing to 49 syn- 
thetic datasets (beyond what is advised for this hardware) 
caused delta compression to decrease 0%, 12%, and 27% 
from the baseline for the three real datasets. Our intuition 
is that the variability in results is due to locality differ- 
ences among these datasets. In general, these results sug- 
gest our caching technique degrades in a gradual manner 
as the number of replicating datasets increases relative to 
the cache size. 


This experiment investigates how multiple datasets 
sharing a cache affect delta compression, and we vali- 
date these findings with results from the field presented 
in Section 8, where customers achieved 2X additional 
delta compression beyond deduplication even though 
their systems had multiple datasets sharing a storage ap- 
pliance. While we do not know the upper bound on 
how much delta compression these customers could have 
achieved in a single-dataset scenario, these results sug- 
gest sizable network savings. 


$ Results from Customers 


Basic replication has been available with EMC Data Do- 
main systems for many years using the deduplication 
protocol of Figure 3, and the extra delta compression 
stage became available in 2009. The version available to 
customers has a cache scaled to the number of supported 
replication streams. 


We analyzed daily reports from several hundred stor- 
age systems used by our customers during the second 
week of August 2011, including a variety of hardware 
configurations. Reporting median values, a typical cus- 
tomer transferred 1 TB of data across a 3.1 Mb/s link dur- 
ing the week, though because of our compression tech- 
niques, much less data was physically transferred across 
the network. Median total compression was 32X includ- 
ing deduplication, delta, and local compression. Fig- 
ure 10 shows the distribution of delta compression with 
50% of customers achieving over 2X additional com- 
pression beyond what deduplication alone achieves, and 
outliers achieving 5X additional delta compression. Con- 
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Figure 10: Distribution of delta compression. 50% of 
customers achieve over 2X additional delta compression. 
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Figure 11: Distribution of hours saved by customers. We 
estimate that 50% of customers save over 588 hours of 
replication time per week because of our combination of 
compression techniques. 


current work [38] provides further analysis of replication 
and backup storage in general. 

Finally, in Figure 11, we show how much time was 
saved by our customers versus sending data without any 
compression. Our reports indicate how much data was 
transferred, an estimate of network throughput (though 
periodic throttling is difficult to extract), and compres- 
sion, So we can calculate how long replication would 
take without compression. The median customer would 
need 608 hours to fully replicate their data (more hours 
than are in a week), but with our combined compres- 
sion, replication reduced to 20 hours (saving 588 hours 
of network transfer time). For such customers, 1t would 
be impossible for them to replicate their data each week 
without compression, so delta replication significantly 
increases the amount of data that can be protected. 


9 Related Work 


Our stream-informed delta replication project builds 
upon previous work in the areas of optimizing network 
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transfer, delta compression, similarity detection, dedu- 
plication, and caching techniques. 

Minimizing network transfer has been an area of on- 
going research. One of the earliest projects by Spring 
et al. [33] removed duplicate regions in packets with a 
synchronized cache by expanding from duplicate start- 
ing points. LBFS [25] divided a client’s file into chunks 
and deduplicated chunks against any previously stored. 
Jumbo Store [14] used a hierarchical representation of 
files that allowed them to quickly check whether large 
subregions of files were unchanged. CZIP [26] applied a 
similar technique with user level caches to remove dupli- 
cate chunks while synchronizing remote repositories. 

Most work in file synchronization has assumed that 
versions are well identified so that compression can 
be achieved relative to one (or a few) specified file(s). 
Rsync [37] is a widely used tool for synchronizing fold- 
ers of files based on compressing against files with the 
same pathname. An improvement [35] recursively split 
files to find large duplicate regions using a memory 
cache. 

Beyond finding duplicates during network transfer, 
delta compression is a well known technique for com- 
puting the difference between two files or data ob- 
jects [17, 36]. Delta compression was applied to web 
pages [8, 24] and file transfer and storage [7, 9, 21, 34] 
using a URL and file name, respectively, to identify a 
previous version. 

When versioning information is unavailable, a mecha- 
nism is needed to find a previous, similar file or data ob- 
ject to use as the base for delta compression. Broder [4, 
5] performed some of the early work in the resem- 
blance field by creating features (such as Rabin finger- 
prints [31]) to represent data such that similar data tend 
to have identical features. Features were further grouped 
into super-features to improve matching efficiency by 
reducing the number of indexes. Features and super- 
features were used to select an appropriate base file for 
deduplication and delta compression [12, 19], removing 
the earlier requirement for versioning information. TA- 
PER [18] presented an alternative to super-features by 
representing files with a Bloom filter storing chunk fin- 
gerprints and measuring file similarity based on the num- 
ber of matching bits between Bloom filters and then delta 
compressing similar files. Delta compression within the 
storage system has used super-feature techniques to iden- 
tify similar files or regions of files [1, 40]. Aronovich et 
al. [1] used 16 MB chunks to decrease sketch indexing 
requirements and had hundreds of disk spindles for per- 
formance. 

Storage systems have eliminated duplicate regions 
based on querying an index of fingerprints [3, 22, 29, 30]. 
Noting that the fingerprint index becomes much larger 
than will fit in memory and that disk accesses can be- 


come the bottleneck, Zhu et al. [41] presented a tech- 
nique to take advantage of stream locality to reduce 
disk accesses by 99%. Several variants of this ap- 
proach explored alternative indexing strategies to load 
a fingerprint cache such as moving the index to flash 
memory [10] and indexing a subset of fingerprints e1- 
ther based on logical or post-deduplication layout on 
disk [2, 16, 20, 23, 39]. Our similarity detection ap- 
proach builds upon these caching ideas to load sketches 
as well as fingerprints into a stream-informed cache. 


10 Conclusion and Future Work 


In this paper, we present stream-informed delta com- 
pression for replication of backup datasets across a 
WAN. Our approach leverages deduplication locality to 
also find similarity matches used for delta compression. 
While locality properties of duplicate data have been pre- 
viously studied, we present the first evidence that similar 
data has the same locality. We show that using a compact 
stream-informed cache to load sketches achieves almost 
as much delta compression as using a full index without 
extra data structures. Our technique has been incorpo- 
rated into the Data Domain systems, and average cus- 
tomers achieve 2X additional compression beyond dedu- 
plication and save hundreds of hours of replication time 
each week. 

In future work, we would like to expand the number 
of WAN environments that benefit from delta replica- 
tion by improving the read throughput, which currently 
gates our system. Also, we would like to further ex- 
plore delta compression techniques to improve compres- 
sion and scalability. 
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Abstract 


Power consumption has become an important factor 
in modern storage system design. Power efficiency is 
particularly beneficial in disk-based backup systems that 
store mostly cold data, have significant idle periods, and 
must compete with the operational costs of tape-based 
backup. There are no prior published studies on power 
consumption in these systems, leaving researchers and 
practitioners to rely on existing assumptions. In this pa- 
per we present the first analysis of power consumption 
in real-world, enterprise, disk-based backup storage sys- 
tems. We uncovered several important observations, in- 
cluding some that challenge conventional wisdom. We 
discuss their impact on future power-efficient designs. 


1 Introduction 


Power has become an important design consideration 
for modern storage systems as data centers now account 
for close to 1.5% of the world’s total energy consump- 
tion [14], with studies showing that up to 40% of that 
power comes from storage [25]. Power consumption is 
particularly important for disk-based backup systems be- 
cause: (1) they contain large amounts of data, often stor- 
ing several copies of data in higher storage tiers; (2) most 
of the data is cold, as backups are generally only accessed 
when there is a failure in a higher storage tier; (3) backup 
workloads are periodic, often leaving long idle periods 
that lend themselves to low power modes [31,35]; and 
(4) they must compete with the operational costs of low 
power, tape-based backup systems. 

Even though there has been a significant amount 
of work to improve power consumption in backup or 
archival storage systems [8,21,27], as well as in primary 
storage systems [3, 33, 36], there are no previously pub- 
lished studies of how these systems consume power in 
the real world. As a result, power management in backup 
storage systems is often based on assumptions and com- 
monly held beliefs that may not hold true in practice. For 
example, prior power calculations have assumed that the 
only power needed for a drive is quoted in the vendor’s 
specification sheet [8, 27,34]. However, an infrastruc- 
ture, including HBAs, enclosures, and fans, is required to 
support these drives; these draw a non-trivial amount of 
power, which grows proportionally with the number of 
drives in the system. 

In this paper, we present the first study of power 
consumption in real-world, large-scale, enterprise, disk- 
based backup storage systems. We measured systems 


Andrew W. Leung? Erez Zadokj 


tBackup Recovery Systems Division 


EMC Corporation 


representing several different generations of production 
hardware using various backup workloads and power 
management techniques. Some of our key observa- 
tions include considerable power consumption variations 
across seemingly similar platforms, disk enclosures that 
require more power than the drives they house, and the 
need for many disks to be in a low-power mode before 
significant power can be saved. We discuss the impact of 
our observations and hope they can aid both the storage 
industry and research communities in future development 
of power management technologies. 


2 Related Work 


Empirical power consumption studies have guided the 
design of many systems outside of storage. Mobile 
phones and laptop power designs, which are both sensi- 
tive to battery lifetime, were influenced by several stud- 
ies [7,17,22,24]. In data centers, studies have focused 
on measuring CPU [18,23], OS [5,6, 11], and infrastruc- 
ture power consumption [4] to give an overview of where 
power is going and the impact various techniques have, 
such as dynamic voltage and frequency scaling (DVFS). 
Recently, Sehgal et al. [26] measured how various file 
system configurations impact power consumption. 


Existing storage system power management has 
largely focused on managing disk power consumption. 
Much of this existing work assumes that as storage 
systems scale their capacity—particularly backup and 
archival systems—the number of disks will increase to 
the point where disks are the dominant power con- 
sumers. As a result, most solutions try to keep as 
many drives powered-off as possible, spun-down, or spun 
at a lower RPM. For example, archival systems like 
MAID [8] and Pergamum [27] use data placement, scrub- 
bing, and recovery techniques that enable many of the 
drives in the system to be in a low-power mode. Sim- 
ilarly, PARAID [33] allows transitioning between sev- 
eral different RAID layouts to trade-off energy, perfor- 
mance, and reliability. Hibernator [36] allows drives in a 
RAID array to operate at various RPMs, reducing power 
consumption while limiting the impact to performance. 
Write Off-Loading [19] redirects writes from low-power 
disks to available storage elsewhere, allowing disks to 
stay in a low-power mode longer. 


Our goal is to provide power consumption measure- 
ments from real-world, enterprise-scale backup systems, 
to help guide designs of power-managed storage systems. 
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3 Methodology 


We measured several real-world, enterprise-class backup 
storage systems. Each used a Network-Attached-Storage 
(NAS) architecture with a storage controller connected to 
multiple, external disk drive enclosures. Figure | shows 
the basic system architecture. Each storage controller ex- 
ports to file-based interfaces to clients, such as NFS and 
CIFS—and backup-based interfaces, such as VTL and 
those of backup software (e.g., Symantec’s OST [20]). 
Each storage controller performs inline data deduplica- 
tion; typically these systems contain more CPUs and 
memory than other storage systems to perform chunking 
and to maintain a chunk index. 
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Figure 1: Backup system architecture 
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Table 1: Controller hardware summary 





Table 1 details the four different EMC controllers that 
we measured. Each controller was shipped or will be 
shipped in a different year and represents hardware up- 
grades over time. Each controller, except for DD670, 
stores all backup data on disks in external enclosures, 
and the four disks (three active plus a spare) in the con- 
troller store only system and configuration data. DD670 
is a low-end, low-cost system that stores both user and 
system data in its seven disks (six active plus a spare). 
DDTBD is planned for a future release and does not yet 
have a model number. Each controller ran the same soft- 
ware version of the DDOS operating system. 


Table 2 shows the two different enclosures that we 
measured. Each enclosure can support various capacity 
SATA drives. Based on vendor specifications, the drives 
we used have power usage of about 6—8W idle, 8—12W 
active, and less than 1W when spun-down. Controllers 
communicate with the enclosures via Serial Attached 
SCSI (SAS). Large system configurations can support 
more than fifty enclosures attached to a single controller, 
which can host more than a petabyte of physical capacity 
and tens of petabytes of logical, deduplicated capacity. 
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Table 2: Enclosure hardware summary 





Experimental setup. We measured controller power 
consumption using a Fluke 345 Power Quality Clamp 
Meter [10], an in-line meter that measures the power 
draw of a device. The meter provides readings with an er- 
ror of +2.5%. We measured enclosure power consump- 
tion using a WattsUP Pro ES [32], another in-line me- 
ter, with an accuracy of +1.5% for measured value plus 
a constant error of --0.3 watt-hours. All measurements 
were done within a data center environment with room 
temperature held between 70 °F and 72 °F. 

We connected the controllers and enclosures to the me- 
ters separately, to measure their power. Thus we present 
component’s measurement separately, rather than as an 
entire system (e.g., a controller attached to several enclo- 
sures). The meters we used allowed us to measure only 
entire device power consumption, not individual com- 
ponents (e.g., each CPU or HBA) or data-center factors 
(e.g., cooling or network infrastructure). We present all 
measurements in watts and all results are an average of 
several readings with standard deviations less than 5%. 


Benchmarks. For each controller and enclosure, we 
measured the power consumption when idle and when 
under several backup workloads. Each workload is a 
standard, reproducible workload used internally to test 
system performance and functionality. The workloads 
consist of two clients connecting over a 10 GigE network 
to a controller writing 36 backup streams. Each backup 
stream is periodic in nature, where a full backup image is 
copied to the controller, followed by several incremental 
backups, followed by another full backup, and so on. For 
each workload we ran 42 full backup generations. The 
workloads are designed to mimic those seen in the field 
for various backup protocols. 


-____[WE-A WEB] WE-C 


Table 3: Backup workloads used 





We used the three backup protocols shown in Ta- 
ble 3. Clients send backup streams over NFS in WL-A, 
and over Symantec’s OST in WL-B. In both cases, all 
deduplication is performed on the server. WL-C uses, 
BOOST [9], an EMC backup client that performs stream 
chunking on the client side and sends only unique chunks 
to the server, reducing network and server load. To mea- 
sure the power consumption of a fully utilized disk sub- 
system, we used an internal tool that saturates each disk. 
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4 Discussion 


We present our analysis for a variety of configurations 
in three parts: isolated controller measurements, isolated 
enclosure measurements, and whole-system analysis us- 
ing controller and enclosure measurements. 


4.1 Controller Measurements 


We measured storage controller power consumption un- 
der three different scenarios: idle, loaded, and power 
managed using processor-specific power-saving states. 


Controller idle power. A storage controller is consid- 
ered idle when it is fully powered on, but is not handling 
a backup or restore workload. In our experiments, each 
controller was running a full, freshly installed, DDOS 
software stack, which included several small background 
daemon processes. However, as no user data was placed 
on the systems, background jobs such as garbage collec- 
tion, were not run. Idle power consumption indicates the 
minimum amount of power a non-power-managed con- 
troller would consume when sitting in the data center. 

It is commonly assumed that disks are the main con- 
tributor to power in a storage system. As shown in Ta- 
ble 4, the controllers can also consume a large amount of 
power. In the case of DDTBD, the power consumption 
is almost equal to that of 100 2TB drives [13]. This is 
significant because even a controller with no usable disk 
storage can consume a lot of power. Yet, the performance 
of the controller is critical to maintain high deduplication 
ratios, and necessary to support petabytes of storage— 
requiring multiple fast CPUs and lots of RAM. These 
high idle power-consumption levels are well known [15]. 
Although computer component vendors have been reduc- 
ing power consumption in newer systems, there is a long 
way to go to support true power proportionality in com- 
puting systems; therefore, current idle controller power 
levels must be factored into future designs. 





Table 4 shows a large difference in power consumption 
between controllers. DDTBD consumes almost 3.5 x 
more power than DD670. Here, difference is largely due 
to the different hardware profiles. DDTBD is a more 
powerful, high-end controller with significantly more 
CPU and memory, whereas DD670 is a low-end model. 
However, this is not the case for the power differences be- 
tween DD880 and DD860. DD880 consumes more than 
twice the power as DD860, yet Table 1 shows that their 
hardware profiles are fairly similar. The amount of CPU 
and memory plays a major role in power consumption; 
however, other factors such as the power efficiency of in- 
dividual components also contribute. Unfortunately, our 
measurement methodology prevented us from identify- 
ing the internal components that contribute to this differ- 


Table 4: Idle power consumptions for storage controllers 


ence. However, part of this difference can be attributed to 
DD860 being a newer model with hardware components 
that consume less power than older models. 


To better compare controller power consumption, we 
normalized the power consumption numbers in Table 4 
to the maximum usable physical storage capacity. The 
maximum capacities for the DD880, DD670, DD860, 
and DDTBD are 192TB, 76TB, 192TB, and 1152TB, 
respectively. This gives normalized power consumption 
values of 2.89W/TB for DD880, 2.96W/TB for DD670, 
1.35W/TB for DD860, and 0.675W/TB for DDTBD. Al- 
though the normalized values are roughly the same for 
DD880 and DD670, the watts consumed per raw byte 
trends down with newer generation platforms. 


@ Observation 2: Whereas idle controller power con- 


sumption varies between models, normalized watts per 
byte goes down with newer generations. 





Controller under load. We measured the power con- 
sumption of each controller while running the aforemen- 
tioned workloads. Each controller ran the DDFS dedup- 
licating file system [35] and all required software ser- 
vices. Services such as replication were disabled. The 
power consumed under load approximates the power typ- 
ically seen for controllers in-use in a data center. The 
workloads used are performance-qualification tests that 
are designed to mimic real customer workloads, but do 
not guarantee that the controllers are stressed maximally. 


Figure 2(a) shows the power consumed by DDTBD 
while running the WL-A workload. The maximum power 
consumed during the run was 937W, which is 20% higher 
than the idle power consumption. Since the power only 
increased 20% when under load, it may be more bene- 
ficial to improve idle consumption before trying to im- 
prove active (under load) consumption. 


/__ [D880 | D670 [DD860 | DDTBD 


Table 5: Power increase ratios from idle to loaded system 





Table 5 shows the power increase percents from idle 
to loaded across controller and workload combinations. 
Several combinations have an increase of less than 30%, 
while others exceed 50%. Unfortunately, our method- 
ology did not allow us to identify which internal compo- 
nents caused the increase. One noticeable trend is that the 
increase in power is mostly due to the controller model 
rather than the workload, as DD880 and DD860 always 
increased more than DD670 and DDTBD. 
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(b) I/O statistics 


Figure 2: Power consumption and I/O statistics for WL-A on DDTBD, along with the 5 ES30 enclosures attached to it 


M@ Observation 3: The increase in controller power 
I/O statistics from the disk sub-system help explain the 
increases in controller power consumption. Figure 2(b) 
shows the number of blocks per second read and written 
to the enclosures attached to DDTBD during WL-A. We 
see that a higher rate of disk I/O activity generally cor- 
responds to higher power consumption in both the con- 
troller and disk enclosures. Whereas I/Os require the con- 
troller to wait on the disk sub-system, they also increase 
memory copying activity, communication with the sub- 
system, and deduplication fingerprint hashing. 


Power-managed controller. Our backup systems per- 
form in-line, chunk-based deduplication, requiring sig- 
nificant CPU and RAM amounts to compute and manage 
hashes. As the data path is highly CPU-intensive, apply- 
ing DVFS techniques during backup—a common way to 
manage CPU power consumption—can degrade perfor- 
mance. Although it is difficult to throttle CPU during a 
backup, the backup processes are usually separated by 
large idle periods, which provide an opportunity to ex- 
ploit DVFS an other power-saving techniques. 


Intel has introduced a small set of CPU power-saving 
states, which represent a range of CPU states from fully 
active to mostly powered-off. For example, on the 
Corei7, Cl uses clock-gating to reduce processor activ- 
ity, C3 powers down L2 caches, and C6 shuts off the 
core’s power supply entirely [28]. To evaluate the effi- 
cacy of the Intel C states on an idle controller, we mea- 
sured the power savings of the deepest C state. Unfor- 
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tunately, DDTBD was the only model that supported the 
Intel C states. We used a modified version of CPUIDLE 
to place DDTBD into the C6 state [16]. In this state, 
DDTBD saved just 60W, a mere 8% of total controller 
power consumption. This finding suggests that DVFS 
alone is insufficient for saving power in controllers with 
today’s CPUs and a great deal of RAM. Moreover, deeper 
C states incur higher latency penalties and slow controller 
performance. We found that the latencies made the con- 
troller virtually unusable when in the deepest C state. 


M@ Observation 4: Placing today’s Intel CPUs into 


deep C state saves only a small amount of power and 
significantly harms controller performance. 





4.2 Enclosure Measurements 


We now analyze the power consumption of two genera- 
tions of disk enclosures. Similar to Section 4.1, we an- 
alyzed the power consumption of the enclosures when 
idle, under load, and using power-saving techniques. 


Enclosure idle power. An enclosure is idle when it 
is powered on and has no workload running. The idle 
power consumption of an enclosure represents the lowest 
amount of power a single enclosure and the housed disks 
consume without power-management support. Figure 3 
shows that an idle ES20 consumes 278W. The number of 
active enclosures in a high-capacity system can exceed 
50, so the total power consumption of the disk enclosures 
alone can exceed 13kW. 

We found that the enclosures have very different power 
profiles. The idle ES20 consumes 278W, which is 55% 
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Figure 3: Disk power down vs. spin down. ES20 and ES30 are 
specified as in Table 2. 


higher than the idle ES30, at 179W. We believe that 
newer hardware largely accounts for this difference. For 
example, it is well known that power supplies are not 
100% efficient. Modern power supplies often place guar- 
antees on efficiency. One standard [1] provides an 80% 
efficiency guarantee, which means the efficiency will 
never go below 80% (e.g., for every 10W drawn from the 
wall, at least 8W is usable by components attached to the 
power supply). The ES30 has newly designed power sup- 
plies, temperature-based fan speeds, and a newer internal 
controller, which contribute to this difference. 


@ Observation 5: The idle power consumption varies 


greatly across enclosures with new ones being more 
power efficient. 





Enclosure under load. We also measured the power 
consumption of each enclosure under the workloads dis- 
cussed in Section 3. We considered an enclosure under 
load when it was actively handling an I/O workload. 

As shown in Figure 2(a), the total power consumption 
of the five ES30 enclosures connected to DDTBD, pro- 
cessing WL-A, increased by 10% from 900W when idle 
to about 1kW. Not surprisingly, Figure 2(b) shows that an 
increase in enclosure power correlates with an increase in 
I/O traffic. Percentages for the other enclosure and work- 
load combinations ranged from 6—22%. 

Our deduplicating file system greatly reduces the 
amount of I/O traffic seen by the disk sub-system. As 
described in Section 3, we used an internal tool to mea- 
sure the power consumption of a fully utilized disk sub- 
system. Table 6 shows that ES20 consumption grew by 
22% from 278W when idle to 340W. ES30 increased 
15% from 179W idle to 205W. Interestingly, these in- 
creases are much smaller than those observed for the con- 
trollers under load in Section 4.1. 


@ Observation 6: The consumption of the enclosures 
increases between 15% and 22% under heavy load. 

Power managed enclosure. We compared the power 
consumption of ES20 and ES30 using two disk power- 
saving techniques: power-down and spin-down. With 


spin-down, the disk is powered on, but the head is parked 
and the motor is stopped. With power-down, the enclo- 


PES 20 ES30 
TSS Power (W) | 278 | 179 
Max Power (W)| 340 | 205 
Table 6: Max power for enclosures ES20 and ES30 


sure’s disk slot is powered off, cutting off all drive power. 

As shown in Figure 3, the relative power savings of 
the ES20 and ES30 are quite different. For ES30, spin- 
down reduced power consumption by 55% from 179W 
to 80W. For ES20, the power dropped by 37% from 
278W to 176W. Although the absolute spin-down savings 
was roughly 100W for both enclosures, power-down was 
much more effective for ES30 than ES20. Power-down 
for ES30 reduced power consumption by 78%, but only 
44% for ES20. As mentioned in Section 3, each disk con- 
sumes less than 1W when spun-down. However, for both 
ES20 and ES30, power-down saved more than |W per 
disk compared to spin-down. 





M@ Observation 7: Disk power-down may be more ef- 





fective than disk spin-down for both ES20 and ES30. 


Looking closer at the ES20 power savings, the enclo- 
sure actually consumes more power than the disks it is 
housing (an improvement opportunity for enclosure man- 
ufactures). With all disks powered down, ES20 consumes 
155W, which is more than the 123W saved by powering 
down the disks (consistent with disk vendor specs). 


@ Observation 8: Disk enclosures may consume more 
power than the drives they house. As a result, effec- 


tive power management of the storage subsystem may 
require more than just disk-based power-management. 





We observed that an idle ES30 enclosure consumes 
64% of an idle ES20, while a ES30 in power-down mode 
consumes only 25% of the power of an ES20 in power- 
down mode. This suggests that newer hardware’s idle 
and especially power-managed modes are getting better. 


4.3. System-Level Measurements 


A common metric for evaluating a power management 
technique is the percentage of total system power that is 
saved. We measured the amount of power savings for dif- 
ferent controller and enclosure combinations using spin- 
down and power-down techniques. We considered sys- 
tem configurations with an idle controller and 32 idle en- 
closures (which totals 512 disks for ES20 and 480 disks 
for ES30) and we varied the number of enclosures that 
have all their disks power managed. We excluded DD670 
because it supports only up to 4 external shelves. 

Figure 4 shows the percentage of total system power 
saved as the number of enclosures with power-managed 
disks was increased. In Figure 4(a) disks were spun 
down, while in Figure 4(b) disks were powered down. 
We found that it took a considerable number of power- 
managed disks to yield a significant system power sav- 
ings. In the best case with DD860 and ES30, 13 of the 32 
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(b) Disk Power Down vs. Power Savings Percentage 


Figure 4: Total system power savings using disk power management 


enclosures must have their disks spun down to achieve 
a 20% power savings. In other words, over 40% of the 
disks must be spun down to save 20% of the total power. 
In the worse case with DDTBD and ES20, 19 of the 32 
enclosures must have their disks spun down to achieve a 
20% savings. This scenario required almost 60% of the 
disks to be spun down to save 20% of the power. Only 
two of our six configurations were able to achieve more 
than 50% savings even when all disks were spun down. 
These numbers were improved when power down is used, 
but a large number of disks was still needed to achieve 





The limited power savings is due in part to the con- 
trollers consuming a large amount of power. As seen in 
Section 4.1, a single controller may consume as much 


power as 100 disks. Additionally, as shown in Sec- 
tion 4.2, disk enclosures can consume more power than 
all of the drives they house, and the number of enclosures 
must scale with the number of drives in the system. These 
observations indicate that for some systems, even aggres- 
sive disk power management may be insufficient to save 
enough power and that power must be saved elsewhere in 
the system (e.g., reducing controller and enclosure power 
consumption, new electronics, etc.). 


5 Conclusions 


We presented the first study of power consumption in 
real-world, large-scale, enterprise, disk-based backup 
storage systems. Although we investigated only a hand- 
ful of systems, we already uncovered a three interesting 
observations that may impact the design of future power- 
efficient backup storage systems. 

(1) We found that components other than disks con- 
sume a significant amount of power, even at large scales. 
We observed that both storage controllers and enclosures 
can consume large amounts of power. For example, 
DDTBD consumes more power than 100 2TB drives and 
ES20 consumes more power than the drives it houses. As 
a result, future power-efficient designs should look be- 
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yond disks to target controllers and enclosures as well. 

(2) We found a large difference between idle and ac- 
tive power consumption across models. For some mod- 
els, active power consumption is only 20% higher than 
idle, while it is up to 60% higher for others. This ob- 
servation indicates that existing systems are not achiev- 
ing energy proportionality [2,4, 12,29, 30], which states 
that systems should consume power proportional to the 
amount of work performed. For some systems, we found 
a disproportionate amount of power used while idle. As 
backups often run on particular schedules, these systems 
may spend a lot of time idle, opening up opportunities to 
further reduce power consumption. 

(3) We discovered large power consumption differ- 
ences between similar hardware. Despite having simi- 
lar hardware specifications, we observed that the older 
DD880 model consumed twice as much idle power as 
the newer DD860 model. We also saw that an idle ES20 
consumed 55% more power than an idle ES30. This sug- 
gests that the power profile of an existing system can be 
improved by retiring old hardware with newer, more effi- 
cient hardware. We hope to see continuing improvements 
from manufacturers of electronics and computer parts. 


Future work. ‘To evaluate the steady state power pro- 
file of a backup storage system, we plan to measure a 
system that has been aged and a system with active back- 
ground tasks. For comparison, we would like to study 
power use of primary storage systems and clustered stor- 
age systems, whose hardware and workloads are different 
than backup systems. Lastly, we would like to investigate 
the contribution of individual computer component (e.g., 
CPUs and RAM) on overall power consumption. 
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Abstract 


File system bugs that corrupt file system metadata on disk 
are insidious. Existing file-system reliability methods, 
such as checksums, redundancy, or transactional updates, 
merely ensure that the corruption is reliably preserved. 
The typical workarounds, based on using backups or re- 
pairing the file system, are painfully slow. Worse, the re- 
covery is performed long after the original error occurred 
and thus may result in further corruption and data loss. 

We present a system called Recon that protects file sys- 
tem metadata from buggy file system operations. Our ap- 
proach leverages modern file systems that provide crash 
consistency using transactional updates. We define declar- 
ative statements called consistency invariants for a file 
system. These invariants must be satisfied by each trans- 
action being committed to disk to preserve file system in- 
tegrity. Recon checks these invariants at commit, thereby 
minimizing the damage caused by buggy file systems. 

The major challenges to this approach are specifying 
invariants and interpreting file system behavior correctly 
without relying on the file system code. Recon provides 
a framework for file-system specific metadata interpreta- 
tion and invariant checking. We show the feasibility of 
interpreting metadata and writing consistency invariants 
for the Linux ext3 file system using this framework. Re- 
con can detect random as well as targeted file-system cor- 
ruption at runtime as effectively as the offline e2fsck file- 
system checker, with low overhead. 


1 Introduction 


It is no surprise that file systems have bugs [20, 29, 31]. 
Modern file systems are designed to support a range of en- 
vironments, from smart phones to high-end servers, while 
delivering high performance. Further, they must handle a 
large number of failure conditions while preserving data 
integrity. Ironically, the resulting complexity leads to bugs 
that can be hard to detect even under heavy testing. These 
bugs can cause silent data corruption [20, 19], random ap- 
plication crashes, or even worse, security exploits [30]. 
Unlike hardware errors and crash failures, it is much 
harder to recover from data corruption caused by bugs 
in file-system code. Hardware errors can be handled 
by using checksums and redundancy for error detection 
and recovery [4, 10]. Crash failure recovery can be 
performed using transactional methods, such as journal- 
ing [12], shadow paging [14], and soft updates [9]. Mod- 


ern file systems, such as ZFS, are carefully designed to 
handle a wide range of disk faults [32]. However, the ma- 
chinery used for protecting against disk corruption (e.g., 
checksums, replication and transactional updates) does 
not help if the file system code itself is the source of an 
error, in which case these mechanisms only serve to faith- 
fully preserve the incorrect state. 


File system bugs that cause severe metadata corrup- 
tion appear regularly. We compiled a list of bugs in the 
Linux ext3 and the recently deployed btrfs file systems, 
by searching for “ext3 corruption” and “btrfs corruption” 
in various distribution-specific bug trackers or mailing 
lists. Based on the bug description and discussions, we 
removed bugs that did not cause metadata inconsistency, 
or were not reproducible, or were reported by a single user 
only. Table | summarizes the remaining bugs. Note that 
ext3, despite its maturity and widespread use, shows con- 
tinuing reports of corruption bugs. One recent example is 
not yet closed, while another closed only in 2010 and af- 
fected the ext2, ext3 and ext4 file systems. These reports 
likely under-represent the problem because the bugs that 
cause metadata corruption may be fail silent, 1.e., the error 
is not reported at the time of the original corruption. By 
the time the inconsistencies appear, the damage may have 
escalated, making it harder to pinpoint the problem. 


When metadata corruption is discovered, it requires 
complex recovery procedures. Current solutions fall in 
two categories, both of which are unsatisfactory. One 
approach is to use disaster recovery methods, such as 
a backup or a snapshot, but these can cause significant 
downtime and loss of recent data. Another option is to 
use an offline consistency check tool (e.g., e2fsck) for 
restoring file system consistency. While a consistency 
check can detect most failures, it requires the entire disk 
to be checked, causing significant downtime for large file 
systems. This problem is getting worse because disk ca- 
pacities are growing faster than disk bandwidth and seek 
time [13]. Furthermore, the consistency check is run after 
the fact, often after a system crash occurs or even less fre- 
quently with journaling file systems. Thus an error may 
propagate and cause significant damage, making repair a 
non-trivial process [11]. For example, Section 5 shows 
that a single byte corruption may cause repair to fail. 


To minimize the need for offline recovery methods, our 
aim is to verify file-system metadata consistency at run- 
time. Metadata is more vulnerable to corruption by file 
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ext3 corruption fix 


Linux: Data corrupting ext3 bug in 2.4.20 

panic/ext3 fs corruption with RHEL4-U6-re20070927.0 

Re: [2.6.27] filesystem (ext3) corruption (access beyond end) 
linux-2.6: ext3 filesystem corruption 


linux-image-2.6.29-2-amd64: occasional ext3 filesystem corruption 


ENOSPC during fsstress leads to filesystem corruption on ext2, ext3, and ext4 


Closed 


2002-06 
2002-12 
2007-11 
2008-06 
2008-09 
2009-06 
2010-03 


https://Ikml.org/Ikm1/2011/6/16/99 | ext3: Fix fs corruption when make_indexed_dir() fails 2011-06 
Redhat, #658391 Data corruption: resume from hibernate always ends up with EXT3 fs errors 


https://Ikml.org/Ikml/2009/8/21/45 | btrfs rb corruption fix 2009-08 


[2.6.33 regression] btrfs mount causes memory corruption 





2010-02 
2010-09 
2011-02 
2011-04 


Table 1: File system bugs causing data corruption. All Red Hat and Debian bugs are rated high-severity. The severity 


level of bugs obtained from mailing lists is not known. 


system bugs because the file system directly manipulates 
the contents of metadata blocks. Metadata corruption may 
also result in significant loss of user data because a file 
system operating on incorrect metadata may overwrite ex- 
isting data or render it inaccessible. 

We present a system called Recon that aims to pre- 
serve metadata consistency in the face of arbitrary file- 
system bugs. Our approach leverages modern file systems 
that provide crash consistency using transactional meth- 
ods, such as journaling [28, 6, 27] and shadow paging file 
systems [14, 4, 16]. Recon checks that each transaction 
being committed to disk preserves metadata consistency. 
We derive the checks, which we call consistency invari- 
ants, from the consistency rules used by the offline file 
system checker. A key challenge is to correctly interpret 
file system behavior without relying on the file system 
code. Recon provides a block-layer framework for inter- 
preting file system metadata and invariant checking. 

An important benefit of Recon is its ability to convert 
fail-silent errors into detectable invariant violations, rais- 
ing the possibility of combining Recon with file system 
recovery techniques such as Membrane [26], which are 
unable to handle silent failures. 

Our current implementation of Recon shows the feasi- 
bility of interpreting metadata and writing consistency in- 
variants for the widely used Linux ext3 file system. Recon 
checks ext3 invariants corresponding to most of the con- 
sistency properties checked by the e2fsck offline check 
program. It detects random and type-specific file-system 
corruption as effectively as e2fsck, with low memory and 
performance overhead. At the same time, our approach 
does not suffer from the limitations of offline checking 
described earlier because corruption is detected immedi- 
ately. The rest of the paper describes our approach in de- 
tail and presents the results of our initial evaluation. 
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2 Approach 


The Recon system interposes between the file system and 
the storage device at the block layer and checks a set of 
consistency invariants before permitting metadata writes 
to reach the disk. We derive the invariants from the rules 
used by the file system checker. As an example, the e2fsck 
program checks that file system blocks are not doubly al- 
located. Our invariants check this property at runtime and 
thus prevent file-system bugs from causing any double al- 
location corruption on disk. 

Figure | shows the architecture of the Recon system. 
Recon provides a framework for caching metadata blocks 
and an API for checking file-system specific invariants us- 
ing its metadata cache. A separate cache is maintained 
because the file system cache is untrusted and because it 
allows checking the invariants efficiently. Besides ext3, 
we have also examined the consistency properties of the 
Linux btrfs file system and implemented several btrfs in- 
variants. The paper describes our initial experience with 
adapting our system for btrfs. 

Our approach addresses three challenges: 1) when 
should the consistency properties be checked, 2) what 
properties should be checked, and 3) how should they be 
checked. Below, we describe these challenges and how 
we address them. The caching framework and the file- 
system specific Recon APIs are described in Section 4. 


2.1 When to Check Consistency? 


The in-memory copies of metadata may be temporarily 
inconsistent during file system operation and so it is not 
easy to check consistency properties at arbitrary times. In- 
stead, checks can be performed when the file system itself 
claims that metadata is consistent. For example, journal- 
ing and shadow-paging file systems are already designed 
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Figure 1: The Recon Architecture 


to ensure crash consistency using transactional methods, 
wherein disk blocks from one or more operations, such as 
the creation of a directory and a file write, are grouped into 
transactions. Transaction commits are well-defined points 
at which the file system believes that it is consistent, and 
hence transaction boundaries serve as convenient vantage 
points for verifying consistency properties. Recon checks 
transactions before they commit, thereby ensuring that a 
committed transaction is consistent, even in the face of 
arbitrary file system bugs. 

Checking consistency for shadow paging systems is 
relatively straightforward because all transaction data is 
written to disk before the commit block. For example, 
btrfs writes all blocks in a transaction, and then commits 
the transaction by writing its superblock. Recon checks 
each transaction before the superblock is written to disk. 

Checking consistency for journaling file systems is 
more complicated because transaction data is written to 
disk both before and after the commit block. For ex- 
ample, ext3 writes metadata to disk in several steps: 1) 
write metadata to journal, 2) write commit block to jour- 
nal, at which point the transaction is committed, 3) write 
(or checkpoint) metadata to its final destination on disk, 
and 4) free space in the journal. 

During Step 1, Recon copies metadata blocks into its 
write cache, giving it a view of all the updates in a trans- 
action. Then it checks the ext3 transaction in Step 2, 1.e., 
before the commit block is written to the journal, which 
ensures that all blocks in the transaction are checked for 
consistency before they become durable. Checking con- 
sistency after commit could lead to checkpointing a cor- 


Implementing consistency invariants for soft update file systems [9] 
that provide consistency after each write but do not use transactions 
should be possible but will likely be more complicated. 


rupt block, and furthermore it would not be possible to 
undo such corruption. Besides checking consistency at 
commit, we also need to verify the checkpointing process. 
This step requires checking that all the committed blocks 
and their contents are checkpointed correctly. 


2.2 What Consistency Properties to Check? 


Identifying the correct consistency properties is challeng- 
ing because the behavior of the file system is not for- 
mally specified. Fortunately, we can derive an informal 
specification of metadata consistency properties from of- 
fline file-system consistency checkers, such as the Linux 
e2fsck program. For example, Gunawi et al. found that 
the Linux e2fsck program checks 121 properties that are 
common to both ext2 and ext3 file systems and some ext3 
journal properties and optional features [| |]. 

These consistency properties define what it means to 
have consistent metadata on disk. Our aim is to ensure 
that any metadata committed to disk will maintain these 
same consistency properties. Unfortunately, consistency 
properties are global statements about the file system. For 
example, a simple check implemented by e2fsck is that 
the deletion times of al/ used inodes are zero. Determin- 
ing the in-use status of all inodes, and checking the dele- 
tion time of all used inodes is infeasible at every trans- 
action commit. Similarly, another consistency property is 
that all live data blocks are marked in the block bitmap. 
Checking these global properties requires a full disk scan. 

Instead, we derive a consistency invariant from each 
consistency property. The invariant is a local assertion 
that must hold for a transaction to preserve the corre- 
sponding file system consistency property. For example, 
consider the “all live data blocks are marked in the block 
bitmap” property. The corresponding consistency invari- 
ant is that a transaction that makes a data block live (i.e., 
by adding a pointer to the block) must also contain a corre- 
sponding bit-flip (from 0 to 1) in the block bitmap within 
the same transaction, i.e., the invariant is “block pointer 
set from 0 to N © bit N set in bitmap”. This invariant can 
be checked by examining only the updated blocks, 1.e., the 
updated pointer block and the updated block bitmap must 
be part of the same transaction. We describe this invariant 
in more detail in Section 3.2. 

We structure a consistency invariant as an implication, 
A = B. The premise A always involves an update to some 
data structure field, and hence checking the invariant is 
triggered by a change in that field. When such an update 
occurs then the conclusion B must be true to preserve the 
invariant. If a converse B => A invariant also exists, then 
we refer to the two invariants as a biconditional invariant 
A <> B, as shown in the example above. 

We rely on the ability to convert consistency proper- 
ties requiring global information into invariants that can 
be checked using information “local” to the transaction, 
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as described in the previous example. Such a transfor- 
mation must be possible because file systems keep them- 
selves consistent without examining the entire disk state. 
In other words, our invariant checking should not require 
much more data than the file system itself needs for its 
operations. Section 5 shows that this is indeed the case 
because Recon has low overheads. 

Finally, our invariant checking approach relies on an 
inductive argument. It assumes that the file system is con- 
sistent before each transaction. If the updates in the trans- 
action meet the consistency invariants, the file system will 
remain consistent after the transaction. Likewise, if an in- 
variant is violated, there is potential for data loss or in- 
correct data being returned to applications. Section 2.4 
provides more details about our assumptions. 


2.3. How to Check Consistency Invariants? 


Consistency invariants are expressed in terms of logical 
file-system data structures, such as current and updated 
values of block pointers, bits in block bitmap, etc.. How- 
ever, Recon needs to observe physical blocks below the 
file system because it cannot trust a buggy file system to 
provide the correct logical data structure information. We 
bridge this semantic gap by inferring the types of metadata 
blocks when they are read or written, which allows pars- 
ing and interpreting them, similar to semantically smart 
disks [24]. Then Recon checks invariants on the typed 
blocks at commit points, as described below. 


Metadata interpretation Block typing and metadata 
interpretation depend on the idea that file systems access 
metadata by following a graph of pointers. For example, 
a pointer to a block is read before the pointed-to block is 
read, which we call the pointer-before-block assumption. 
These pointers may be explicit block pointers or are 1m- 
plied by the structure of the file system. For example, ext3 
will read an inode containing a pointer to an indirect block 
before reading the indirect block. When an inode block is 
read, Recon copies it into its read cache and then parses 
the inodes in the block to create a mapping from a block to 
its type for any metadata blocks pointed to by the inodes. 
In this case, Recon creates a block-type mapping associat- 
ing the “indirect block” type with the block pointed to by 
the EXT3_IND_BLOCK pointer in the inode. As a result, 
Recon recognizes an indirect block when it is read. 
Similarly, the block group descriptor (BGD) tables in 
ext3 describe the locations of inode blocks and inode and 
block allocation bitmaps. The BGD tables must be read 
before any of the blocks that they point to, allowing Re- 
con to create block-type mappings for inode and bitmap 
blocks. This block-type identification is bootstrapped us- 
ing the superblock, which exists at a known location. 
When a metadata block is newly allocated in a transac- 
tion, Recon does not yet know its type. In this case, there 


FAST 712: 10th USENIX Conference on File and Storage Technologies 


must exist an updated metadata block in the transaction 
with a known type that points to this unclassified block 
directly or indirectly, or else the newly allocated block 
would not be reachable in the file system. By following 
the path of pointers from the known metadata block to 
the newly allocated block, Recon can always create block- 
type mappings for newly allocated blocks. 

For example, suppose a block is allocated to an indirect 
block of a file. If the file already existed then its inode 
block must have been read and updated in the transac- 
tion. Since the inode block was read previously, Recon 
knows its type and can determine the type of the newly 
allocated indirect block. Similarly, if the file did not exist 
then its parent directory must have existed and been up- 
dated, which helps determine the types of the (possibly 
newly allocated) inode block and then the indirect block. 
Determining the types of newly allocated blocks may re- 
quire multiple passes over the blocks updated in the trans- 
action. At the end, all new metadata blocks must be typed 
or else the pointer-before-block assumption is violated. 


Commit processing At commit, Recon uses the block- 
type mapping to determine the data structures in each of 
the (updated) transaction blocks, available in the Recon 
write cache. These data structures are compared with their 
previous versions, which are derived from the Recon read 
cache, at the granularity of data structure fields. Each field 
update generates a logical change record with the format 
[type, id, field, oldval, newval]. 

The type specifies a data structure (e.g., inode, directory 
block). The id is the unique identifier of a specific object 
of the given type (e.g. inode number). The (type, id) pair 
allows locating the specific data structure in the file sys- 
tem image. The field is a field in the structure (e.g. inode 
size field) or a key from a set (e.g. directory entry name). 
The oldval and newval are the old and new values of the 
corresponding field. These records are generated for ex- 
isting, newly allocated and deallocated metadata blocks. 
When an item is newly created or allocated, the oldval is 
@ (a sentinel value). Similarly, when an item is destroyed 
or deallocated, the newval is @. 

Figure 2 shows an example of a set of change records 
associated with an ext3 transaction in which a single 
write operation increases the size of a file from one block 
to two blocks. Change records serve as an abstraction, 
cleanly separating the interpretation of physical metadata 
blocks from invariant checking on logical data structures. 
We show how invariants are implemented using change 
records in Section 3. When all invariants are checked suc- 
cessfully, the transaction is allowed to commit. 


2.4 Fault Model 


Our goal is to preserve file-system metadata consistency 
in the presence of arbitrary file-system bugs. We make 
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LInode, 12, block[li], 0, 22717] : 


[BBM, 22717, 0, 0, 1] : 
[BGD, 0, free_blocks, 1500, 1499] : 
LInode, 12, i_size, 4010, 7249] ; 
[LInode, 12, i_blocks, 8, 16] : 
[Inode, 12, mtime, 1-18-12, 1-20-12] ; 
[Inode, 12, ctime, 1-16-12, 1-20-12] ; 


Figure 2: Change records when a block 1s added to a file 


three assumptions to provide this guarantee. First, we 
assume that the Recon code and its invariant checks are 
correct and immutable and the Recon metadata cache is 
protected. If these assumptions are incorrect, it is unlikely 
that an inconsistent transaction would pass our checks, be- 
cause the file-system bug and our corrupted check would 
need to be correlated. However, Recon may generate false 
alarms, indicating corruption even when a transaction is 
consistent. Such corruption is still an indication of a bug 
in the overall system. A hypervisor-based Recon imple- 
mentation would provide stronger isolation of the Recon 
code and data from the kernel, helping ensure metadata 
consistency in the face of arbitrary kernel bugs. 

Second, if the ext3 file system writes a metadata block 
before Recon knows its type then Recon will assume that 
a data block is being written and will allow the opera- 
tion. For example, a file system bug may corrupt the block 
number in a disk request structure and cause a misdirected 
write to a metadata block. Recon will not detect this er- 
ror because the write violates our pointer-before-block as- 
sumption, and ext3 does not provide any other way to 
identify the block being updated.” As future work, we 
plan to retrofit ext3 to allow such identification. Misdi- 
rected writes will not cause a problem with btrfs because 
its extents are self-identifying [2]. 

Finally, our inductive assumption about metadata con- 
sistency before each transaction (discussed in Section 2.2) 
requires correct functioning of the lower layers of the sys- 
tem, including the Linux block device layer and all hard- 
ware in the data path. It is possible to detect and recover 
from errors at these layers by using metadata checksums 
and redundancy. This functionality could be implemented 
at the block layer for the ext3 file system [10]. The btrfs 
file system already provides such functionality [16]. If 
these assumptions are not met, offline checking and repair 
should be used as a last resort. 


3 Consistency Invariants 


A file system checker verifies file system consistency by 
applying a comprehensive set of rules for detecting and 
optionally repairing inconsistencies. We are primarily in- 


*We did not observe this problem because our fault injector corrupts 
metadata blocks but does not cause misdirected writes (see Section 5.2). 


terested in checking consistency properties and can reuse 
the rules associated with detecting, but not repairing, in- 
consistencies. We have applied our approach to the ext3 
and the btrfs file systems. Below, we provide an overview 
of the consistency rules for these file systems. 


The SQCK system [11] encapsulates the 121 checks of 
the ext3 fsck program in a set of SQL queries. Although 
there is a close correspondence between SQCK queries 
and e2fsck checks, some SQCK queries combine multi- 
ple checks. Table 2 provides a breakdown of the number 
of rules checked by SQCK for different file-system data 
structures. We show 101 rules in Table 2, because the rest 
are used for repair. The simplest checks (lines starting 
with the word Within) examine individual structures (e.g., 
superblock fields, inode fields, and directory entries ap- 
pear valid). Some checks ensure that pointers lie within an 
expected range. More complicated checks (lines starting 
with the word Between) ensure that block pointers (across 
all files) do not point to the same data blocks, and direc- 
tories form a connected tree. 


We have done a similar classification of the rules 
checked by the btrfs checker, as shown in Table 3. Btrfs is 
an extent-based, B-tree file system that stores file-system 
metadata structures (e.g., inodes, directories, etc.) 1n B- 
tree leaves [16]. It uses a shadow-paging transaction 
model for updates and for supporting file-system snap- 
shots. Extent allocation information is maintained in an 
extent B-tree, which serves the same purpose as ext3 
block bitmaps. The roots for all the B-trees are maintained 
in a top-level B-tree called the root tree. Although the 
btrfs checker is still a work in progress (e.g., it performs 
no repair), currently it uses 30 rules for detecting inconsis- 
tencies. Of these, the first four rule sets are used to check 
the structure of the B-tree, while the rest deal with typical 
file-system objects such as inodes and directories. 


Next, we provide several examples that show how 
we transform the consistency properties for various data 
structures shown in Tables 2 and 3 into invariants. An 
invariant is implemented by pattern matching change 
records. When such a match occurs, some invariants accu- 
mulate bookkeeping information then require some final 
processing at transaction commit. 
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In inode 12, direct block ptr 1 is set to block 22717 
Block 22717 is marked allocated in block bitmap 

In block group 0, nr. of free blocks decreases by 1 
1_size field increases from 4010 to 7249 bytes 
i_blocks is the number of sectors used by file 
timestamp change 

timestamp change 


ea 


78 


[B_| Within block group descripion (BGD) [3 
D_| Within single directory «iS 
PE | Between inode and directory enties [3 
PF | Between inode andits block pointers [| 2 


G | Between inode, inode bitmap, orphan ist | 3 
FHL | Between block bitmap and block pointers |S 
FT| Between block, inode bitmap, BGD table | 3 
PT [Between directories | 
PK | Bad blocks inode Sd 


Table 2: Number of Ext3/SQCK rules by datatype 





3.1 Ext3 Immutable Fields, Range Checks 


The ext3 fsck program checks for valid values in several 
fields of the superblock and group descriptor table (rows 
A and B in Table 2). Many of these fields are initialized 
when a file system is created and should never be mod- 
ified by a running file system. Invariants on these fields 
are implemented by pattern matching a change record of 
the form [Superblock, _, immutable_field, _, _], where 
immutable_field is the name of the field that should not 
change, and _ matches any value. The existence of this 
record indicates that the field was modified, and signals a 
violation. Another similar class of consistency properties 
requires simple range checks on the values of given fields. 


3.2. Ext3 Block Bitmap and Block Pointers 


An important consistency properties in ext3 is that no data 
block may be doubly allocated, 1.e., every block pointer 
(whether it is found in a live inode or indirect block) must 
be unique or 0. Checking this property would be expen- 
sive if we simply scanned all inodes and indirect blocks 
searching for another instance of the pointer. 

The file system maintains this property without examin- 
ing the entire disk state by using block allocation bitmaps 
(row H in Table 2), with the resulting consistency property 
being that “all live data blocks are marked in the block 
bitmap”. The corresponding consistency invariant is that 
a transaction that makes a data block live (1.e., by adding 
a pointer to the block) must also contain a corresponding 
bit-flip (from O to 1) in the block bitmap within the same 
transaction, as shown below. 


block pointer set to N from 0 = bit N set in bitmap 
block pointer set to 0 from N <= bit N unset in bitmap 


(1) 
(2) 

These invariants involve relationships between differ- 
ent fields and require matching multiple change records. 
The left side of the first invariant is triggered by match- 
ing change records of the form [_, _ , block_pointer_field, 
0, X], indicating a new pointer to block X. When such a 
match occurs, we insert a “new pointer” flag with key X 
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B_| Between parent and child wee blocks | 3 
[D_| Within anextentitem in extent twee ‘| 2 


FT| Between inode, data extents, checksum wee | 6 


Table 3: Number of Btrfs rules by datatype 





into a rule-specific table. The right side of this (bicon- 
ditional) invariant is triggered by matching [BBM, Y, _, 
0, 1] records, indicating bit Y in the allocation bitmap is 
newly set. When this match occurs, we insert a “‘bit set” 
flag with key Y into the same table. During final process- 
ing, the implementation verifies that for each key in the 
table, both flags are set. Otherwise the invariant has been 
violated. For example, in the simple transaction shown in 
Figure 2, there is exactly one record matching each of the 
left and right sides of Invariant 1 shown above, and the 
values of X and Y are both 22717. 

Invariants 1 and 2 ensure that when a block pointer is 
set, the corresponding bit in the bitmap is also set. How- 
ever, we must also ensure that a pointer to the same block 
is set only once in a transaction, 1.e., we must check for 
double allocation within a transaction. To do so, we sim- 
ply count the number of times we see a block pointer set 
to a given block in the transaction: 
block pointer set to N > 

(count(block pointer==N) in transaction)== 


3.3. Ext3 Directories 


The inter-directory consistency properties essentially en- 
sure that the directory tree forms a single, bidirected° 
tree (row J in Table 2). This complex consistency prop- 
erty requires two biconditional and two regular invariants. 
Whenever a directory is linked (or its “..” entry changes), 
Invariant 4 checks that the directory’s parent (child) has 
the directory as its child (parent). This check also ensures 
that a directory does not have multiple parents. When a 
directory is unlinked (or moved), Invariant 5 checks that 
it is unlinked on both sides (although not shown, we also 
check that an unlinked directory is empty). When a direc- 
tory’s “.” entry is updated, Invariant 6 checks that the “.” 
entry points to itself. 


[Dir, C, “..’, _, P] = [Dir, P, nm, _, C] and (nm != “..”) 


(3) 


(4) 


3A bidirected tree is the directed graph obtained from an undirected 
tree by replacing each edge by two directed edges in opposite directions. 
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[Dir, C, “..”, P, _] = [Dir, P, nm, C, _] and (nm != “..”) (5) 
[Dir, D1, “.’, _, D2] = D1l==D2 (6) 
[Dir, _, “..’, _, P] = is_ancestor(ROOT, P) (7) 


Finally, Invariant 7 checks that a directory update does 
not cause cycles. Invariants 4 and 5 do not prohibit cy- 
cles. For example, suppose that the file system allows the 
command “mv /a /a/b” to complete successfully. This up- 
date would be allowed by the Invariants 4 and 5, but it 
would create a disconnected cycle consisting of a and b. 
Invariant 7 checks for cycles when a directory’s parent en- 
try (the “..” entry) is updated. It ensures that the chain of 
parent directories eventually reaches the root directory, or 
a cycle is detected. The is_ancestor() primitive operates 
on the Recon metadata caches described in Section 4. 


3.4 Btrfs Inode and Directory Entries 


Metadata structures in btrfs are indexed by a 17-byte key 
consisting of the tuple (objectID, type, offset). ObjectID 
is roughly analogous to an inode number in ext3. The type 
field determines the type of the structure, and the meaning 
of “offset” depends on the type. Each key is unique within 
a btrfs tree, so the unique (type, id) pair for our change 
records consists of (type, (tree id, objectid, offset)). 

A btrfs consistency property is that the inode associ- 
ated with a directory item (that is, a btrfs directory entry) 
has a directory mode (row F in Table 3). An invariant 
derived from this property is that when we add a new di- 
rectory item, there must exist an appropriate inode item 
after transaction commit. We can represent this as: 


[DIR_ITEM, (T, I, _),_, @,_] => 
exists(T, I, INODE_ITEM, 0) and 
ISDIR(get_item(T, I, INODE_ITEM, 0).mode) 


The left hand side matches a directory item within snap- 
shot tree T and objectid I that is being newly created. This 
invariant asserts that 1) there is a matching inode item, and 
2) its mode is of directory type. The exists() primitive re- 
turns true if the given item can be found in tree T, and the 
get_item primitive obtains the contents of the item, allow- 
ing us to check the mode. These primitives operate on the 
Recon metadata caches. 


4 Implementation 


We use the Linux device mapper framework to interpose 
on all file system I/O requests at the block layer, as shown 
in Figure |. On a metadata block read, recon_read caches 
the block in the Recon read cache. This cache allows ac- 
cessing the disk or the pre-update file-system metadata 
state efficiently. Its contents are trusted because its blocks 
have been verified previously. On a metadata block write, 
recon_write caches the updated block in the Recon write 
cache. The write cache may contain corrupt data and thus 
any code accessing this cache must perform careful vali- 
dation. Both caches also store block-specific information 


such as the block-type map. Similar to a file system buffer 
cache, neither Recon cache persists across reboots. 


4.1 Commit Process 


At commit, our framework requires that 1) all transac- 
tion blocks must have been recorded using recon_write, 
and 2) recon_commit is called before the commit block 
reaches the disk. We can record blocks and detect com- 
mit either from the transaction subsystem (transaction- 
layer commit) or at the block layer (block-layer commit). 
With transaction-layer commit, the file system’s transac- 
tion commit code is modified to invoke recon_write on 
the updated metadata blocks, and invoke recon_commit 
before writing the commit block. This method is simpler 
to implement, but it makes us dependent on the transac- 
tion layer code, such as JBD in ext3. In particular, it does 
not allow us to verify the ext3 checkpointing process. 

With block-layer commit, recon_write could be in- 
voked on all block writes. The challenge is to separate 
metadata blocks from data blocks because we do not want 
to cache every data block. However, we can only identify 
newly allocated metadata blocks at commit, making them 
hard to distinguish from data on each write. Fortunately, 
for ext3, metadata blocks are written to the journal, and 
thus we can ignore blocks that are not journaled. This 
approach requires interpreting journal writes at the block 
layer, which also helps detect commit. While this im- 
plementation is more complicated, it removes any depen- 
dency on the journaling code. For btrfs, metadata writes 
can be easily distinguished because they are directed to 
designated regions on disk called btrfs chunks. Btrfs com- 
mits occur when the superblock is written, which is easy 
to detect because the superblock is in a known location. 

We have implemented both transaction-layer and 
block-layer commit, but currently we have only evaluated 
the transaction-layer commit implementation. 


4.2 Cache Pinning and Eviction 


We control the amount of memory used by the Re- 
con caches with a simple LRU mechanism for replacing 
blocks from the read cache when it grows beyond a user- 
configurable limit. All read cache blocks are pinned dur- 
ing recon_commit processing to simplify implementation. 
We expect that recon_commit will run quickly because 
the blocks needed for commit processing have likely been 
read by the file system recently and so they will not need 
to be read from disk to populate the read cache. We pin 
the Recon write cache for the duration of the transaction 
because we will need these blocks for checking invariants. 
This approach is similar to the ext3 file system pinning its 
journal blocks for performance. However, we could unpin 
a block once it reaches disk, e.g., the journal in ext3. 
After commit, the contents of the write cache are 
merged into the read cache, thus updating Recon’s view 
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FS Recon API | Invoked on [| 
provides type and id information for data structures in referenced blocks 


provides type and id information for newly allocated metadata blocks 





process_txn Commit generates change records 


txn_ check Commit 


checks invariants using change records and metadata read/write caches 


Table 4: File-system specific Recon API 


of file-system state, and the write cache is cleared. At this 
point, we can unpin the read cache because all the blocks 
in the cache are on disk (e.g., either in the journal or the 
checkpointed location in ext3). However, our transaction- 
layer commit implementation for ext3 does not track the 
location of blocks in the journal. To avoid evicting a block 
that may be in the journal, we keep a list of most recently 
updated blocks in the read cache. This list contains as 
many blocks as it takes to fill the journal and we pin these 
blocks. Once a block is evicted from this list, it must have 
been checkpointed, or else it would have been overwritten 
in the journal, and so we can unpin it. 


4.3 File-System Specific Processing 


Recon invokes file-system specific API functions for 
metadata interpretation and invariant checking, as shown 
in Table 4. The references function is invoked by re- 
con_read to parse a metadata block and create block-type 
mappings for pointed-to blocks. This function is also used 
to distinguish between data and metadata on the read path. 

The rest of the functions in Table 4 are invoked by 
recon_commit. The process_write function is similar to 
the references function but invoked on all the blocks in 
the write cache (i.e., each updated or newly allocated 
metadata block). This function must validate the updated 
blocks by checking that any pointers, strings and size 
fields within the block have reasonable values so that fur- 
ther processing is not compromised. Recon ignores un- 
known blocks and only processes updated blocks whose 
types are known. As unknown blocks become known, 
they are added to the queue of blocks being processed. 
At the end of write processing, if any unknown blocks re- 
main, Recon signals a reachability invariant violation, as 
discussed in Section 2.3. 

Once the block and data types within blocks are known, 
the process_txn function compares updated data structures 
with their previous versions to derive a set of change 
records. The previous version of a data structure is 
uniquely determined by the (type, id) pair of the change 
record. In ext3, the type is determined by block type and 
the id is typically an inode number or a block number. In 
btrfs, the type and id are determined by the tree and the 
key, as discussed in Section 3.4. 

While the process of comparing data structures is 
clearly file-system specific, we found two common cases. 
When data structures have fixed size, such as inodes in 
ext3 and most items in btrfs, we use a simple byte-level 


FAST 712: 10th USENIX Conference on File and Storage Technologies 


diff that is driven by tables that describe the layout of 
the data structures. These tables are generated from the 
data structures using C macros. When data structures 
themselves contain sets of smaller items, such as direc- 
tory entries in ext3, or extent items in btrfs, we use a set- 
intersection method to derive three sets consisting of new 
items, deleted items and modified items. Change records 
can be generated from these sets, using the identity of the 
containing item (e.g., directory inode) and some key as 
field name (such as the “name” for directory entries). 

The txn_check function implements invariant checking 
as described in Section 3 with examples. 


4.4 Handling Invariant Violation 


The final problem for an online consistency checker like 
Recon is dealing with invariant violations. It is important 
to ensure that recovery from a violation is correct and so 
the safest strategy is to disable all further modifications 
to the file system to avoid corruption. The file system 
can then be unmounted and restarted manually or trans- 
parently to applications [26]. In this case, the file system 
is not corrupt but may have lost some data. If the ability 
to create a snapshot (e.g., a btrfs snapshot) is available, 
then a snapshot could be created immediately, the prob- 
lem reported, and then we could continue running the file 
system. It is important to isolate the snapshot from the 
buggy file system, e.g., by directing all further writes to a 
separate partition. In this case, data is preserved but the 
file system may be corrupt. Finally, it may be possible to 
repair file system data structures dynamically [8]. 


5 Evaluation 


In this section, we evaluate our Recon implementation for 
ext3 in terms of its 1) complexity, 2) ability to detect meta- 
data corruption at runtime, and 3) its performance impact. 
Currently, we are finishing our btrfs implementation, and 
we plan to evaluate it in the near future. 


5.1 Completeness and Complexity 


We have implemented all of the checks performed by the 
e2fsck file system checker, as encapsulated by the SQCK 
rules, for the mandatory file system features. Overall, 
we need only 31 invariants (vs 101 SQCK rules) because 
some properties are easier to verify at runtime. For exam- 
ple, a large number of fields in the superblock and block 
group descriptors are protected with the simple invariant 
that they should not be changed by a running file system. 
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We also avoid explicit range check invariants in several 
cases because they are naturally embedded in other in- 
variants that must check for setting or clearing of bits in 
bitmaps. There are a small number of properties on op- 
tional features that we do not check, such as OS-specific 
fields in inodes and the extended attributes ACLs. 


Our entire system consists of 3.8k lines of C code 
(kLOC), as measured by the cloc [7] tool. Of these, 1.5 
kLOC are in the generic framework which can be reused 
across file systems, 1.5 kKLOC are for interpreting the ext3 
metadata, and only 0.8 KLOC are involved in checking the 
invariants. Our dependence on the journal checkpointing 
code adds another 311 lines. The code required to do the 
checking is simpler than the file system code for several 
reasons. First, within the thread checking a transaction, 
we do not need to worry about concurrency, as the buffers 
we are examining are under the control of the journal. In 
contrast, the file system needs to be servicing multiple 
client threads. Second, the implementation of each invari- 
ant check is independent of the other checks because each 
rule uses its own data structures to keep track of properties 
that must be verified. Finally, the implementation of each 
rule is usually quite simple, requiring several lines of C to 
accumulate the necessary data and a few more (often just 
a single boolean expression) to verify. 


5.2 Ability to Detect Corruption 


Evaluating resiliency against metadata corruption is 
tricky. To best represent real-world corruption scenarios, 
we would either inject subtle bugs in the file-system or 
reproduce known bugs. However, subtle bugs (i.e., bugs 
not easily found in a heavily-used file system) are hard 
to design or reproduce. Reproducing known bugs is dif- 
ficult as they often depend on specific kernel versions, 
combinations of loadable modules, concurrency levels, or 
workloads. Instead, we settled for deliberately injecting 
corruption of bytes within metadata blocks. This mim- 
ics the corruption that could result from several types of 
bugs (e.g., setting values in arbitrary fields incorrectly) 
both within the file system or in the overall kernel. We in- 
jected both type-specific corruption, where we target spe- 
cific metadata block types and fields, and fully random 
corruption where we corrupt a sequence of | to 8 bytes 
within some number of blocks in a transaction. 


Setup We compare Recon against e2fsck by corrupting 
metadata just before it is committed to the journal. We 
begin each corruption experiment by creating and pop- 
ulating a fresh file system, to ensure that there are no 
errors initially. Next, we start a process that creates a 
background of I/O operations (specifically we run a ker- 
nel compile and clean, repeatedly). The corruptor then 
sleeps for 20-90 seconds, wakes up, and performs the re- 
quested corruption (type-specific or random). We record 
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Figure 3: Comparison of corruption detection accuracy 


the corruption performed and whether or not Recon de- 
tected it. Next, we allow the transaction to commit, and 
then immediately prevent any future writes. This step en- 
sures that the corruption is limited to the bytes that we 
selected, rather than the result of the file system acting 
further on corrupt data. Next, we unmount the file sys- 
tem, run e2fsck on it, and record whether it found and 
repaired any errors. Finally, we run e2fsck a second time 
to see if the file system is clean after the repairs, and then 
reboot the system for the next experiment. For these ex- 
periments, we use a 4 GB file mounted as a loop device 
for our file system. This simplified the restoration of the 
file system following each corruption experiment. 

Our corruption framework can only corrupt blocks that 
the file system is already modifying in some transaction. 
In particular, we never corrupt the superblock since the 
running file system never includes writes to it. We do 
not consider this to be a serious limitation to our test re- 
sults since nearly all superblock corruptions would be triv- 
ially detected by Recon. Specifically, Recon protects most 
fields in the superblock with the invariant that they should 
not be modified at all, which is very easy to check. 


Results Figure 3 summarizes the results of our corrup- 
tion experiments. We show a wide bar and two stacked 
bars for each type of metadata corruption and random cor- 
ruption. The wide bar shows the percent of corruptions 
(Y axis) that were caught by both e2fsck and Recon. The 
stacked bars show the percent of corruptions that were de- 
tected by only one checker. Numbers in the bars show the 
absolute number of corruptions detected. 

For inodes, we present 3 sets of bars, representing dif- 
ferent types of inode fields. The first group includes fields 
that are reported by “‘stat’’, the second group consists of all 
the block pointer fields, and the third group consists of ev- 
erything else. Our coverage is nearly identical to e2fsck in 
all cases. Many of the inode stat fields are unrelated to file 
system consistency (e.g. the timestamps and userids) and 
are permitted to change arbitrarily, making it hard to de- 
tect corruption with either checker. However, both check- 
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ers are effective at catching corruption of block pointers. 
Recon achieves 100% in this case because it checks all 
inodes in a block being written to disk while e2fsck ig- 
nores unused inodes. Although file system consistency is 
not affected by changes to unused inodes, it is still useful 
to detect this corruption because it indicates a bug in the 
system. For the final set of inode fields, e2fsck detects 
an invalid flag setting that Recon does not check in two 
runs, while Recon catches corruption of some unused in- 
ode flags and a corruption of the dir_acl field that appears 
valid when checked by e2fsck after the fact in four runs. 

For directory entries (dir), both checkers detect the 
same corruptions, with neither checker detecting corrup- 
tion of the name field. For the other metadata types, 
Recon is more effective than e2fsck at detecting corrup- 
tion, largely because it is able to take other runtime be- 
havior into account. For example, Recon achieves 100% 
detection for block group descriptor (bgd) corruption be- 
cause most of these fields should not be changed by a run- 
ning file system. Once corruption has reached the disk 
however, it is not always possible to distinguish the cor- 
rect values from corrupted, but still valid, values. Simi- 
larly, Recon detects 100% of the block and inode bitmap 
(bbm and ibm, respectively) corruptions while e2fsck has 
a lower detection rate because it does not check unused 
parts of metadata blocks. For example, e2fsck does not 
check bits in the inode bitmap for non-existent inodes, or 
bits in the block bitmap for uninitialized block group de- 
scriptor table blocks. Recon’s higher coverage on specific 
metadata fields leads to higher coverage for fully random 
corruption as well. We expect that adding the final set of 
ext3 invariants for OS-specific inode fields and extended 
attributes will help us detect all ext3 structural consistency 
violations. However, neither checker can achieve 100% 
accuracy because some of the corruptions hit fields unre- 
lated to structural consistency. 

After e2fsck performs repair, it still detects errors in 
28 out of 731 cases (3.8%), when it 1s run a second time 
on the “repaired” file system. Two of these failures oc- 
curred after a single byte was corrupted in a single meta- 
data block. In our experiments, we unmount the file sys- 
tem and check it with e2fsck immediately after the cor- 
rupted transaction is committed to the journal. In reality, 
it is likely that the file system would continue operation 
with bad data for some time, making the chances of suc- 
cessful repair even lower. In these cases, Recon’s ability 
to prevent corruption from reaching the on-disk metadata 
is particularly valuable. 


5.3. Performance 


Setup All performance tests were done on a | TB ext3- 
formatted file system on a machine with 2GB total RAM 
and dual 3 GHz Xeon CPUs. We used the Linux port 
of FileBench (version 1.4.8.fsl.0.8) with the application 
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nfiles=250k 3.9 GB 
nfiles=S00k 7.8 GB 


Webserver 
Webproxy 
Varmail 
Fileserver 


nfiles=250k 3.9 GB 


nfiles=500k, 15.6 GB 
filesize=32k 


MS-Networkfs based on [17] 19.9 GB 


Table 5: Benchmark Characteristics 





emulation workload personalities*. We included the Net- 
workfs personality, which supports a more sophisticated 
file system model, with a custom profile configured to 
match the metadata characteristics from a recent study of 
Windows desktops[17]. For Fileserver, we reduced the 
default file size to 32k to increase the metadata to data 
ratio in the file system. In all other cases, we used de- 
fault parameter settings. Table 5 summarizes the basic 
characteristics of our benchmarks.” The metadata load 
varies widely across the benchmarks, spanning the range 
of Recon cache sizes, causing misses in the cache. In par- 
ticular, the Fileserver benchmark uses over 25k directo- 
ries. The metadata consumed by directory entry blocks 
alone is greater than 1OOMB. The inodes for the direc- 
tories and files would consume approximately 70MB if 
they were stored compactly, but ext3 distributes alloca- 
tion across different block groups, so unused inodes add 
to the metadata overhead. While the Networkfs bench- 
mark involves more file data, the total number of files is 
lower because of the larger file size distribution. 

The benchmarks are run for one hour for all workloads 
to ensure that we capture steady-state behavior with Re- 
con. We report the performance of Recon compared to 
native ext3 for both the initial benchmark setup, which 
involves heavy metadata writes (Table 6), and the actual 
workload execution (Figure 4). 

Our current transaction-layer commit implementation 
(described in Section 4) cannot evict blocks from our 
metadata cache that have not yet been checkpointed to the 
file system. Thus, the metadata cache size must be larger 
than the journal size. However, any memory consumed by 
Recon’s metadata cache reduces the memory available for 
the file system cache by the same amount because Linux 
implements a shared page cache. We present results for 
three different cache/journal sizes, for both native and Re- 
con performance. FileBench emulates workloads using a 
variety of random variables for file and operation selec- 
tion. Thus, there is natural performance variation across 
runs. Since this is representative of behavior “in the wild’, 
we report the average of 5 runs with error bars. All tests 
are done with cold caches on a freshly booted system. 


+The OLTP personality did not work in the version we obtained. 

>The full profile used in the experiments is available at 
http://csng.cs.toronto.edu/publications/260/get?file= 
/publication_files/210/recon-fast2012-workloads.tgz 
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Cache=64MB, Journal=32MB 


Cache=128MB, Journal=64MB 


Cache=256MB, Journal=128MB 





Setup (seconds) Ext3 Recon Ratio 


Ext3 


Recon Ratio Ext3 Recon Ratio 





2171.0+442.8 2903.2+45.7 | 133.7 
229.4+26.0 323.0+24.3 
110.2+11.4 110.8+4.4 

13728.54694.2 | 17705.8+413.5 
2096.8+140.4 | 2113.84119.2 


Webserver 


Webproxy 











1722.0477.4 
212.8413.5 
118.6+12.3 

11487.2+849.8 
1757.4+70.2 


1668.6436.7 
243.4+23.5 
113.8416.2 

12906.84+1316.8 
1893.04£73.0 


96.9 1405.6424.4 1340.2429.6 | 95.3 
22/2219 5 224.44+24.0 | 98.8 
109.4+9.5 123.0+5.0 | 112.4 
9785.64491.6 | 10374.8+928.8 | 106.0 


1651.84113.8 1719.4431.5 | 104.1 











Table 6: Setup time for benchmarks (lower is better) 


Results During the benchmark setup, when many files 
are being created, there is a significant cost to Recon, par- 
ticularly for small cache sizes. The dominating factor is 
I/O time for metadata cache misses because file creation 
quickly and repeatedly touches the entire working set of 
metadata. However, as the cache size increases, the im- 
pact is rapidly reduced. With a 128MB metadata cache, 
the added overhead of Recon is within the experimental 
error of ext3’s native performance. The impact of Recon 
is less noticeable during normal benchmark operations. 
With our smallest metadata cache size (64MB), there is a 
worst case overhead of only 15% for Fileserver, which is 
generally reduced as the cache size increases. The one ex- 
ception to this trend is the Networkfs personality (ms_nfs 
in Figure 4), where performance degrades with an increas- 
ing Recon cache size. We believe this is the result of 
memory pressure, as our increased metadata cache size 
decreases the amount of memory available to the file sys- 
tem buffer cache. Overall, a 128MB metadata cache with 
a 64MB journal gives the best results for all workloads, 
with only 8% degradation on average. In most cases, file 
system throughput with Recon is within the margin of er- 
ror of ext3 performance. Given the growth in main mem- 
ory sizes, these are quite modest memory requirements for 
the reliability benefits that Recon can deliver. 


6 Related Work 


We discuss several areas of research that are closely re- 
lated to this work, including methods for 1) handling file 
system bugs, 2) checking file system consistency, 3) inter- 
preting file system semantics and verification. 


6.1 Handling File System Bugs 


File system bugs can be detected statically or at runtime. 
Bug finding tools, based on model checking [29, 31] and 
static analysis [21], have revealed scores of bugs in a vari- 
ety of file systems. However, these tools cannot be relied 
upon to identify all bugs because they need to perform 
exhaustive evaluation. Furthermore, even when a bug is 
known, a bug fix may not be easily available, or easy to 
deploy in live systems [1]. These limitations can be ad- 
dressed by tolerating bugs at runtime. 

EnvyFS [3] applies N-version programming for detect- 
ing file system bugs. It uses the common VFS interface 
to pass each file system request received by the VFS layer 
to three child file systems. The results are then compared 
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Figure 4: Performance on FileBench workloads for vary- 
ing metadata cache sizes 


and the majority result is returned. EnvyFS avoids storing 
3 data copies by using a customized single-instance store. 
Although EnvyFS is able to detect and in some cases re- 
pair errors introduced in child file systems, the run time 
overheads are significant because the operations must be 
issued to at least two file systems and the results compared 
before an answer is returned. Also, subtle differences in 
file system semantics can make it hard to compare results. 


Membrane [26] proposes tolerating bugs by transpar- 
ently restarting a failed file system. It assumes that file 
system bugs will lead to detectable, fail-stop crash fail- 
ures. However, inconsistencies may have propagated to 
the on-disk metadata by the time the crash occurs. Our ap- 
proach is complementary to Membrane, rather than wait- 
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ing for the file system to crash, a restart could be initiated 
when Recon detects an inconsistent transaction. 


6.2 Checking File System Consistency 


SQCK [11] expresses the many complex checks per- 
formed by e2fsck as a set of compact SQL queries. It 
improves upon the repairs done by e2fsck by correcting 
the order in which repairs were performed and by using 
redundant file-system metadata ignored by e2fsck. 

Chunkfs proposes reducing the consistency check time 
by breaking the file system into chunks that can be 
checked independently [13]. While this idea is appealing, 
unfortunately the chunks are not independent and thus 
cannot be checked truly independently. Specifically, path- 
names can span chunks, and Chunkfs uses cross-chunk 
references to handle hard links and files that are larger 
than chunks or need allocation across chunks. 

ZFS provides the ability to scrub disks and repair cor- 
rupt blocks that have redundant copies [4]. Scrubbing can 
detect latent hardware errors but does not necessarily de- 
tect software bugs, e.g., if the block has a consistency er- 
ror but passes the checksum. NetApp filers can run some 
phases of the wafliron check program on an online system, 
but this process is resource intensive and time-taking. 


6.3. File System Semantics and Verification 


Semantically-smart disks use probing to gather detailed 
knowledge of file system behavior [24]. This knowledge 
is used at the block interface to transparently improve per- 
formance or enhance functionality, such as by implement- 
ing track-aligned extents and secure delete. This work 
builds on several ideas from semantically-smart disks. 

The XN storage system of the Xok exokernel is de- 
signed to protect library file systems that manage their 
own disk blocks [15]. XN uses a file-system specific 
function called own(), similar to the Recon references() 
function, that returns the blocks controlled by a meta-data 
block. This function allows XN to verify that a file system 
can only access blocks that are allocated to it. XN can also 
use a file-system specific function called reboot() that tra- 
verses the entire file-system tree and detects whether the 
file system is crash consistent. This work shows that file- 
system consistency can be verified at runtime efficiently. 
File systems must use an extended block interface (e.g., 
allocate, read, write, deallocate) and provide block type 
information to XN and which allows easier verification, 
while Recon only requires the basic block interface (e.g., 
read, write) and infers file system information. Also, XN 
protects file systems from each other and may allow a file 
system to corrupt itself, while our focus is on protecting 
the file system from itself. Similar to XN, a type-safe 
disk extends the disk interface by exposing primitives for 
block allocation [23], which helps enforce invariants such 
as preventing accesses to unallocated blocks. 


FAST 712: 10th USENIX Conference on File and Storage Technologies 


There has been significant work on discovering pro- 
gram invariants by capturing variable values at key points 
in a program to repair data structures [8] and to patch 
buggy deployed software [18]. We plan to apply these 
methods to learn file-system invariants and repair updates 
that cause invariant violations. Our work is influenced by 
runtime verification, a technique that applies formal anal- 
ysis to the running system rather than its model [25, 5]. 


Our system can be viewed as a firewall with a set of 
rules that help protect disks from accesses that could com- 
promise file-system integrity. Defining and implementing 
these rules in a high-level language, such as the Linux ipt- 
ables rules [22], is an avenue for future work. 


7 Conclusions and Future Work 


The Recon system protects file system metadata from 
buggy file system operations. It uses two key ideas, using 
commit points to verify consistency invariants. Modern 
file systems aim to ensure file system consistency at com- 
mit points. Consistency invariants are declarative state- 
ments that must be satisfied at these points before data is 
committed or else the file system may get corrupted. We 
reuse the consistency rules used by a file system checker 
to derive the invariants. As a result, Recon detects ran- 
dom corruption at runtime as effectively as the file system 
checker. It has low overhead because the data it interprets 
has likely been recently accessed by the file system. 


A system that checks the file system is easier to imple- 
ment correctly than the file system itself. When check- 
ing a transaction, we do not need to worry about concur- 
rency because the buffers we are examining are under our 
control. In contrast, the file system needs to be servicing 
multiple client threads. Also, each invariant is indepen- 
dent because it uses its own data structures to keep track 
of the properties that must be checked, and we find that 
the implementation of each rule usually quite simple. The 
bulk of the complexity lies in interpreting metadata struc- 
tures. We plan to develop a systematic way to describe 
and interpret these structures. 


While an offline checker can only make decisions based 
on the current file system state, Recon can also observe the 
file system operations in progress. We plan to investigate 
whether this allows detecting certain operational bugs un- 
related to file system consistency, e.g., updates to userid 
or timestamp fields. 
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Abstract 


Virtualization allows computing resources to be utilized 
much more efficiently than those in traditional systems, 
and it is a strong driving force behind commoditizing 
computing infrastructure for providing cloud services. 
Unfortunately, the multiple layers of abstraction that vir- 
tualization introduces also complicate the proper under- 
standing, accurate measurement, and effective manage- 
ment of such an environment. In this paper, we focus 
on one particular layer: storage virtualization, which en- 
ables a host system to map a guest VM’s file system to 
almost any storage media. A flat file in the host file sys- 
tem is commonly used for this purpose. However, as we 
will show, when one file system (guest) runs on top of 
another file system (host), their nested interactions can 
have unexpected and significant performance implica- 
tions (as much as 67% degradation). From performing 
experiments on 42 different combinations of guest and 
host file systems, we give advice on how to and how not 
to nest file systems. 


1 Introduction 


Virtualization has significantly improved hardware uti- 
lization, thus, allowing IT services providers to offer a 
wide range of application, platform and infrastructure so- 
lutions through low-cost, commoditized hardware (e.g., 
Cloud [1, 5, 11]). However, virtualization is a double- 
edged sword. Along with many benefits it brings, vir- 
tualized systems are also more complex, and thus, more 
difficult to understand, measure, and manage. This is 
often caused by layers of abstraction that virtualization 
introduces. One particular type of abstraction, which 
we use often in our virtualized environment but have not 
yet fully understood, is the nesting of file systems in the 
guest and host systems. 

In a typical virtualized environment, a host maps reg- 
ular files as virtual block devices to virtual machines 


VM1 VM2 





Guest File System § Guest File System 
- |devisda = /dev/sda ss /dev/sda 





limages/VM1disk limages/VM2disk _/images/VM3disk 
a a a 


Hypervisor Host File Sy eer 


Hardware 


Figure 1: Scenario of nesting of file systems. 


(VMs). Completely unaware of this, a VM would for- 
mat the block device with a file system that it thinks is 
the most suitable for its particular workload. Now, we 
have two file systems — a host file system and a guest 
file system — both of which are completely unaware of 
the existence of the other layer. Figure | illustrates such 
a scenario. The fact that there is one file system be- 
low another complicates an already delicate situation, 
where file systems make certain assumptions, based on 
which, optimizations are made. When some of these as- 
sumptions are no longer true, these optimizations will no 
longer improve performance, and sometimes, will even 
hurt performance. For example, in the guest file sys- 
tem, optimizations such as placing frequently used files 
on outer disk cylinders for higher I/O throughput (e.g., 
NTES), de-fragmenting files (e.g., QCoW [7]), and en- 
suring meta-data and data locality, can cause some unex- 
pected effects when the real block allocation and place- 
ment decisions are done at a lower level (i.e., in the host). 


An alternative to using files as virtual block devices 
is to give VMs direct access to physical disks or logi- 
cal volumes. However, there are several benefits in map- 
ping virtual block devices as files in host systems. First, 
using files allows storage space overcommit when they 
are thinly provisioned. Second, snapshotting a VM im- 
age using copy-on-write (e.g., using QCoW) is simpler 
at the file level than at the block level. Third, manag- 
ing and maintaining VM images and snapshots as files is 
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also easier and more intuitive as we can leverage many 
existing file-based storage management tools. Moreover, 
the use of nested virtualization [6, 15], where VMs can 
act as hypervisors to create their own VMs, has recently 
been demonstrated to be practical in multiple types of hy- 
pervisors. As this technique encourages more layers of 
file systems stacking on top of one another, it would be 
even more important to better understand the interactions 
across layers and their performance implications. 

In most cases, a file system is chosen over other 
file systems primarily based on the expected workload. 
However, we believe, in a virtualized environment, the 
guest file system should be chosen based on not only 
the workload but also the underlying host file system. 
To validate this, we conduct an extensive set of experi- 
ments using various combinations of guest and host file 
systems including Ext2, Ext3, Ext4, ReiserFS, XFS, and 
JFS. It is well understood that file systems have different 
performance characteristics under different workloads. 
Therefore, instead of comparing different file systems, 
we compare the same guest file system among different 
host file systems, and vice versa. From our experiments, 
we observe significant I/O performance differences. An 
improper combination of guest and host file systems can 
be disastrous to performance; but with an appropriate 
combination, the overhead can be negligible. 

The main contributions of this paper are summarized 
as follows. 


e A quantitative study of the interactions between 
guest and host file systems. We demonstrate that the 
virtualization abstraction at the file system level can 
be more detrimental to the I/O performance than it 
is generally believed. 


e A detailed block-level analysis of different combi- 
nations of guest/host file systems. We uncover the 
reasons behind I/O performance variations in dif- 
ferent file system combinations and suggest various 
tuning techniques to enable more efficient interac- 
tions between guest and host file systems to achieve 
better I/O performance. 


From our experiments, we have made the follow- 
ing interesting observations: (1) for write-dominated 
workloads, journaling in the host file system could 
cause significant performance degradations, (2) for read- 
dominated workloads, nested file systems could even im- 
prove performance, and (3) nested file systems are not 
suitable for workloads that are sensitive to I/O latency. 
We believe that more work is needed to study perfor- 
mance implications of file systems in virtualized envi- 
ronments. Our work takes a first step in this direction, 
and we hope that these findings can help file system de- 
signers to build more adaptive file systems for virtualized 
environments. 
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The remainder of the paper is structured as follows. 
Section 2 surveys related works. Section 3 presents 
macro-benchmarks to understand the performance im- 
plications of nesting file systems under different types 
of workloads. Section 4 uses micro-benchmarks to dis- 
sect the interactions between guest and host file systems 
and their performance implications. Section 5 discusses 
significant consequences of nested file systems with pro- 
posed techniques to improve I/O performance. Finally, 
Section 6 concludes the paper. 


2 Related Work 


Virtualizing I/O, especially storage, has been proven to 
be much more difficult than virtualizing CPU and mem- 
ory. Achieving bare-metal performance from virtual- 
ized storage devices has been the goal of many past 
works. One approach is to use para-virtualized I/O de- 
vice drivers [26], in which, a guest OS is aware of 
running inside of a virtualized environment, and thus, 
uses a special device driver that explicitly cooperates 
with the hypervisor to improve I/O performance. Ex- 
amples include KVM’s VirtIO driver [26], Xen’s para- 
virtualized driver [13], and VMware’s guest tools [9]. 
Additionally, Jujjuri et al. [22] proposed to move the 
para-virtualization interface up the stack to the file sys- 
tem level. 

The use of para-virtualized I/O device drivers is almost 
a de-facto standard to achieve any reasonable I/O perfor- 
mance, however, Yassour et al. [32] explored an alter- 
native solution that gives guest direct access to physical 
devices to achieve near-native hardware performance. In 
this paper, we instead focus on the scenario where vir- 
tual disks are mapped to files rather than physical disks 
or volumes. As we will show, when configured correctly, 
the additional layers of abstraction introduce only limited 
overhead. On the other hand, having these abstractions 
can greatly ease the management of VM images. 

Similar to nesting of file systems, I/O schedulers are 
also often used in a nested fashion, which can result 
in suboptimal I/O scheduling decisions. Boutcher and 
Chandra [17] explored different combinations of I/O 
schedulers in guest and host systems. They demon- 
strated that the worst case combination provides only 
40% throughput of the best case. In our experiments, we 
use the best combination of I/O schedulers found in their 
paper but try different file system combinations, with the 
focus on performance variations caused only by file sys- 
tem artifacts. Whereas, for performance purposes, there 
is no benefit to performing additional I/O scheduling in 
the host, it has a significant impact on inter-application 
I/O isolation and fairness as shown in [23]. Many other 
works [18, 19, 25, 27] have also studied the impact of 
nested I/O schedulers on performance, fairness, and iso- 
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Figure 2: Setup for macro-level experimentation 


lation, and these are orthogonal to our work in the file 
system space. 

When a virtual disk is mapped to an image file, the 
data layout of the image file can significantly affect its 
performance. QCOW2 [7], VirtualBox VDI [8], and 
VMware VMDK [10] are some popular image formats. 
However, as Tang [31] pointed out, these formats unnec- 
essarily mix the function of storage space allocation with 
the function of tracking dirty blocks. Tang presented 
an FVD image format to address this issue and demon- 
strated significant performance improvements for certain 
workloads. Various techniques [16, 20, 30] to dynam- 
ically change the data layout of image files, depending 
on the usage patterns, have also been proposed. Suzuki 
et al. [30] demonstrated that by co-locating data blocked 
used at boot time, a virtual machine can boot much faster. 
Bhadkamkar et al. [16] and Huang et al. [20] exploited 
data replication techniques to decrease the distance be- 
tween temporally related data blocks to improve I/O per- 
formance. Sivathanu et al. [29] studied the performance 
effect of the image file placed at different locations of a 
disk. 

I/O performance in storage virtualization can be im- 
pacted by many factors, such as device driver, I/O sched- 
uler, and image format. To the best of our knowledge, 
this is the first work that studies the impact of the choice 
of file systems in guest and host systems in a virtualiza- 
tion environment. 


3. Macro-benchmark Results 


To better understand the performance implications 
caused by guest / host file system interactions, we take 
a systematic approach in our experimental evaluation. 
First, we exercise macro-benchmarks to understand the 
potential performance impact of nested file systems on 
realistic workloads, from which, we were able to ob- 
serve significant performance impact. In Section 4, we 
use micro-benchmarks coupled with low-level I/O trac- 
ing mechanisms to investigate the underlying cause. 


3.1 Experimental Setup 


As there is no single “most common” or “best” file sys- 
tem to use in the hypervisor or guest VMs, we conduct 


Pentium D 3.4GHz, 2GB RAM Ubuntu 10.04 (2.6.32-33) 


80GB WD 7200 RPM SATA (sda) | qemu-kvm 0.12.3 


1TB WD 7200 RPM SATA (sdb) _| libvirt 0.9.0 


Qemu 0.9, 512MB RAM Ubuntu 10.04 (2.6.32-33) 


Table 1: Testbed setup 





our experiments using all possible combinations of pop- 
ular file systems on Linux (1.e., Ext2, Ext3, Ext4, Reis- 
erFS, XFS, and JFS) in both the hypervisor and guest 
VMs, as shown in Figure 2. A single x86 64-bit machine 
is used to run KVM [24] at the hypervisor level, and 
QEMU [14] is used to run guest VMs !. To reflect typi- 
cal enterprise setting, each guest VM is allocated a single 
dedicated processor core. More hardware and software 
configuration settings are listed in Table 1. 

The entire host OS is installed on a single disk (sda) 
while another single disk (sdb) is used for experiments. 
We create multiple equal-sized partitions from sdb, each 
corresponding to a different host file system. Each parti- 
tion is then formatted using the default parameters of the 
host file system’s mkf s* command and is mounted using 
the default parameters of mount. In the newly created 
host file system, we create a flat file and expose this flat 
file as the logical block device to the guest VM, which in 
turn, further partitions the block device, having each cor- 
responding to a different guest file system. By default, 
virtio [26] is used as the block device driver for the guest 
VM and we consider write-through as a caching mode 
for all backend storages. The end result is the guest VM 
having access to all combinations of guest and host file 
systems. Table 2 shows an example of our setup: a file 
created on /dev/sdb3, which is formatted as Ext3, is 
exposed as a logical block device vdc to the guest VM, 
which further partitions vdc into vdc2, vdc3, vdc4, etc. 
for different guest file systems. Note that all disk parti- 
tions of the hypervisor (sdb*) and the guest (vdc*) are 
properly aligned using fdisk to avoid most of the block 
layer interference caused by misalignment problems. 

In addition to the six host file systems, we also create 
a raw disk partition that is directly exposed to the guest 
VM and is labeled as Block Device (BD) in Table 2. This 
allows a guest file system to sit directly on top of a physi- 
cal disk partition without the extra host file system layer. 
This special case is used as our baseline to demonstrate 
how large (or how small) of an overhead the host file sys- 
tem layer induces. However, there are some side effects 
to this particular setup, and namely, the file systems be- 
ing created on outer disk cylinders will have higher I/O 
throughput than those created on inner cylinders. For- 


‘Similar performance variations are observed in the experiments 
with other hypervisors including Xen and VMWare, which are shown 
in Appendix. 
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Table 2: Physical and logical disk partitions 
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Table 3: Parameters for Filebench workloads 





tunately, as each disk partition created at the hypervisor 
level is 60GB, only a portion of the entire disk is utilized 
and thus limits this effect. Table 2 also shows the results 
of running hdparm on each disk partition. The largest 
throughput difference between any two partitions is only 
about 5%, which is fairly negligible. 

The choice of I/O scheduler at host and guest levels 
can significantly impact performance [17, 21, 27, 28]. As 
file system is the primary focus of this paper, we used 
CFQ scheduler in the host and Deadline scheduler in 
the guest as these schedulers were shown to be the top 
performers in their respective domains by Boutcher and 
Chandra [17]. 


3.2 Benchmarks 


We use Filebench [3] to generate macro-benchmarks 
of different I/O transaction characteristics controlled by 
predefined parameters, such as the number of files to 
be used, average file size, and I/O buffer size. Since 
Filebench supports a synchronization between threads 
to simulate concurrent and sequential I/Os, we use this 
tool to create four server workloads: a file server, a web 
server, a mail server, and a database server. The specific 
parameters of each workload are listed in Table 3, show- 
ing that the experimental working set size is configured 
to be much larger than the size of the page cache in the 
VM. The detailed description of these workloads is as 
follows. 


e File server: Emulates a NFS file service. File op- 
erations are a mixture of create, delete, append, 
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read, write, and attribute on files of various 
SIZES. 


e Web server: Emulates a web service. File oper- 
ations are dominated by reads: open, read, and 
close. Writing to the web log file is emulated by 
having one append operation per open. 


e Mail server: Emulates an e-mail service. File 
Operations are within a single directory consist- 
ing of I/O sequences such as open/read/close, 
open/append/close, and delete. 


e Database server: Emulates the I/O characteristic 
of Oracle 91. File operations are mostly read and 
write on small files. To simulate database logging, 
a stream of synchronous writes is used. 


3.3. Macro-benchmark Results 


Our main objective is to understand how much of a per- 
formance impact nested file systems have on different 
types of workloads, and whether or not the impact can 
be lessened or avoided. As mentioned before, we use 
all combinations of six popular file systems in both the 
hypervisor and guest VMs. For comparison purpose, we 
also include one additional combination, in which the hy- 
pervisor exposes a physical partition to guest VMs as a 
virtual block device. This results in 42 (6 x 7) different 
combinations of storage / file system configurations. 
The performance results are shown in Figures 3 and 6, 
in terms of I/O throughput and I/O latency, respectively. 
Each sub-figure consists of a left and a right side. The 
left side shows the performance results when the guest 
file systems are provisioned directly on top of raw disk 
partitions in the hypervisor. These are expressed in abso- 
lute numbers (i.e., MB per second for throughput or mil- 
lisecond for latency) and are used as our baseline. The 
right side shows the relative performance (to the baseline 
numbers) of the guest file systems when they are provi- 
sioned as files in the host file system. In these figures, 
each column group represents a different storage option 
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Figure 3: I/O throughput for Filebench workloads (higher is better) 


in the hypervisor, and each column within the group rep- 
resents a different storage option in the guest VM. 


3.3.1 Throughput 


The baseline numbers (leftmost column group) show the 
intrinsic characteristics of various file systems under dif- 
ferent types of workloads. These characteristics indicate 
that some file systems are more efficient on large files 
than small files, while some file systems are more ef- 
ficient at reading than writing. As an example, when 
ReiserFS runs on top of BD, its throughput under the 
web server workload (27.2 MB/s) is much higher than 
that under the mail server workload (1.4MB/s). These 
properties of file systems are well understood, and how 
one would choose which file system to use is a straight- 
forward function of the expected I/O workload. How- 
ever, in a virtualized environment where nested file sys- 
tems are often used, the decision becomes more difficult. 
Based on the experimental results, we make the follow- 
ing observations: 

(1) A guest file system’s performance varies signif- 
icantly under different host file systems. Figure 3(B) 
shows an example of the database workload. When Reis- 
erFS runs on top of Ext2, its throughput is reduced by 
67% compared to its baseline number. However, when it 
runs on top of JES, its I/O performance is not impacted at 
all. We use coefficient of variance to quantify how differ- 
ently a guest file system’ performance is affected by dif- 
ferent host file systems, which is shown in Figure 4. For 
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Figure 4: Coefficient of variance of guest file systems’ 
throughput under Filebench workloads across different 
host file systems. 
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Figure 5: Total I/O transaction size of Filebench work- 
loads 
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Figure 6: I/O latency of guest file systems under different workloads (lower is better) 


each workload, a variance number is calculated based on 
relative performance values of a guest file system when 
it runs on top of different host file systems. Our results 
show that the throughput of ReiserFS experiences a large 
variation (45%) under the database workload, while that 
of Ext4 varies insignificantly (4%) under the web server 
workload. The large variance numbers indicate that hav- 
ing the right guest/host file system combination is critical 
to performance, and having a wrong combination can re- 
sult in serious performance degradation. For instance, 
under the database workload, ReiserFS/Ext2 is a right 
combination, but ReiserFS/JFS is a wrong combination. 


(2) A host file system impacts different guest file 
systems’ performance differently. Similar to the pre- 
vious observation, a host file system can have a different 
impact on different guest file systems’ performance. Fig- 
ure 3(A) shows an example of the file server workload. 
When Ext2 runs on top of Ext3, its throughput is slightly 
degraded by about 10%. However, when Ext3 runs on 
top of Ext3, the throughput is reduced by 40%. Based 
on results of coefficient of variance of guest file systems’ 
throughputs shown in Figure 4, we observe that this bi- 
directional dependency between guest and host file sys- 
tems again stresses the importance of choosing the right 
guest/host file system combination. 


(3) A right guest file system/host file system com- 
bination can produce minimal performance degrada- 
tion. Also based on results shown in Figure 4, one can 
also observe how badly performance can be impacted 
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when a wrong combination of guest/host file system is 
chosen. However, it is possible to find a guest file sys- 
tem whose performance loss is the lowest. For example, 
the results of the mail server workload show that once 
Ext2 runs on top of Ext2, its throughput degradation is 
the lowest (by 46%). 

(4) The performance of nested file systems is af- 
fected much more by write than read operations. As 
one can see in Figure 3, all the combinations of nested 
file systems perform poorly for the mail server workload, 
unlike the other three workloads. We study the detailed 
disk traces from these workloads by examining request 
queuing time, request merging, request size, etc., and 
find that the mail server workload is only significantly 
different from the others in having a much higher pro- 
portion of writes than reads, as shown in Figure 5. We 
will use micro-benchmarks in Section 4 to describe the 
reasons behind this behavior. 


3.3.2 Latency 


The latency results are illustrated in Figure 6. Simi- 
lar to I/O throughput, latency is also deteriorated when 
guest file systems are provisioned on top of host file sys- 
tems rather than raw partitions. Whereas the impact to 
throughput can be minimized (for some workloads) by 
choosing the right combinations of guest/host file sys- 
tem, latency is much more sensitive to nesting of file 
systems. In comparison to the baseline, the latency of 
each guest file system varies in a certain range when it 
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Description Parameters 


Total 1/0 size 
I/O parallelism 255 


I/O pattern Random/Sequential 
I/O mode Native asynchronous I/O 





Table 4: FIO benchmark parameters 


runs on top of different host file systems. Even for the 
lowest cases, latency is increased by 5-15% across the 
board (e.g., Ext2 guest file system under the web server 
workload). Coefficient of variance for latency is similar 
to that of throughput shown in Figure 4. However, for 
latency sensitive workloads, like the database workload, 
such a significant increase in I/O response time could be 
unacceptable. 


4 Micro-benchmarks Results 

We first study nested file systems using a micro-level 
benchmark F/O [4]. Based on the experimental results, 
we further conduct an analysis at the block layer on the 
guest VM and the hypervisor, respectively, using an I/O 
tracing mechanism [2]. 


4.1 Benchmark 


We use FIO as a micro-level benchmark to examine disk 
I/O workloads. As a highly configurable benchmark, 
FIO defines a test case based on different I/O transaction 
characteristics, such as total I/O size, block size, num- 
ber of I/O parallelism, and I/O mode. Here our focus 
is on the performance variation of primitive I/O opera- 
tions, such as read and write. With the combination of 
these I/O operations and two I/O pattens, random and se- 
quential, we design four test cases: random read, random 
write, sequential read, and sequential write. The specific 
I/O characteristics of these test cases are listed in Table 4. 


4.2 Experimental Results 


On the same testbed, the experiments are conducted with 
many small files, which create a 5GB of total data foot- 
print for each workload. Figures 7 and 8 show the per- 
formance in both sequential and random I/Os. Based on 
the experimental results, we make two observations: 


e The performance of those workloads that are 
dominated by read operations is largely unaf- 
fected by nested file systems. The performance 
impact is weakly dependent on guest/host file sys- 
tems. More interestingly, for sequential reads, in a 
few scenarios, a nested file system can even improve 
I/O performance (e.g., by 34% for Ext3/JFS). 


e The performance of those workloads that are 
dominated by write operations is heavily affected 
by nested file systems. The performance impact 
varies in both random and sequential writes, with 
higher variations in sequential writes. In particu- 
lar, a host file system like XFS can degrade the per- 
formance by 40% for both random and sequential 
writes. As a result, it is important to understand the 
root cause of this performance impact, especially on 
the sequential write dominated workload. 


To interpret these observations, our analysis will focus 
on sequential workloads and the performance implica- 
tion across certain guest/host file system combinations. 
For this set of experiments with micro-benchmark, due 
to space constraints, we only concentrate on decipher- 
ing the I/O behavior of these representative file system 
combinations. Although only a few combinations are 
considered, principles used here are applicable to other 
combinations as well. 

For sequential read workloads, we attempt to uncover 
the reasons behind the significant performance improve- 
ment on the right guest/host file system combinations. 
We select the combinations of Ext3/JFS and Ext3/BD 
for analysis. For sequential write workloads, we try to 
understand the root cause of the significant performance 
variations in the scenarios of (1) different guest file sys- 
tems running on the same host file system and (2) the 
same guest file system operating on different host file 
systems. We analyze three guest file system/host file 
system combinations: Ext3/ReiserFS, JFS/ReiserFS, 
and JFS/XFS. Here Ext3/ReiserFS and JFS/ReiserFS are 
used to examine how different guest file systems can af- 
fect performance differently on the same host file system, 
while JFS/ReiserFS and JFS/XFS are used to examine 
how different host file systems can affect performance 
differently on the same guest file system. 


4.3. I/O Analysis 


To understand the underlying cause of the performance 
impact due to nesting of file systems, we use blktrace 
to record I/O activities at both the guest and hypervisor 
levels. The resulting trace files are stored on another de- 
vice, thus increasing only 3-4% CPU utilization. There- 
fore, the interference with our benchmarks from such an 
I/O recoding is negligible. Blktrace keeps detailed ac- 
count of each I/O request from start to finish as it goes 
through various I/O states (e.g., put the request onto an 
I/O queue, merge with an existing request, and wait on 
the I/O queue). The I/O states that are of interest to us in 
this study are described as follows. 


e Q: anew I/O request is gueued by an application. 
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Figure 7: I/O throughput of guest file systems in reading files. (A): random and (B) sequential 





| Guest file system Ext2M Ext3M ReiseFSH XFSM JFST | 





























4 160 
a 140 — 
Ss 3 1202 
~ ® 
= 100 § 
2 2 7 so & 
£ c 
5) 60 
_ 

o 1 40 oO 
= 20 

0 . 0 



























































BD Ext2 Ext3 Ext4 ReiserFS XFS JFS 


Host file systems 


A 





























__ 100 160 
Ak 140 — 
= 120 © 
aT i 100 § 
ra 80 & 
= 40 le elle 60 
— h 
oO 20 40 ® 
lie 
0 : Eu 0 





















































BD Ext2 Ext3 Ext4 ReiserFS XFS JFS 


Host file systems 


B 


Figure 8: I/O throughput of guest file systems in writing files. (A): random and (B) sequential 


e I: the I/O request is inserted into an I/O scheduler 
queue. 


e D: the I/O request is being served by the device. 


e C: the I/O request has completed by the device. 


Blktrace records the timestamp when an I/O request 
enters a new State, so it is trivial to calculate the amount 
of time the request spends in each state (1.e., Q2I, I2D, 
and D2C). Here Q2I is the time it takes to insert/merge 
a request onto a request queue. [2D is the time it takes 
to idle on the request queue waiting for merging oppor- 
tunities. D2C is the time it takes for the device to serve 
the request. The sum of Q2I, [I2D, and D2C is the total 
processing time of an I/O request, which we denote as 
Q2C. 


4.3.1 Sequential Read Workload 


As mentioned in the experimental setup, the logical 
block device of the guest VM can be represented as ei- 
ther a flat file or a physical raw disk partition at the hy- 
pervisor level. However, the different representation of 
the guest VM’s block device directly affects the num- 
ber of I/O requests served at the hypervisor level. For 
the selected combinations of Ext3/JFS and Ext3/BD, as 
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Figure 9 shows, the number of I/O requests served at the 
hypervisor’s block layer is significantly lower than that at 
the guest’s block layer. More specifically, if JFS is used 
as a host file system, it greatly reduces the number of 
queued I/O requests sent from the guest level, resulting 
in much fewer I/O requests served at the hypervisor level 
than those at the guest level. If a raw disk partition is 
used instead, although there is no reduction on the num- 
ber of queued I/O requests, the hypervisor level’s block 
layer also lowers the number of served I/O requests by 
merging queued I/O requests. 

There are two root causes for these I/O behaviors: 
(1) the file prefetching technique at the hypervisor level, 
known as readahead, and (2) the merging activities at the 
hypervisor level introduced by the I/O scheduler. The de- 
tailed descriptions of these root causes are given below. 

First, there are frequent accesses to both files’ con- 
tent and metadata in a sequential read dominated work- 
load. To expedite this process, readahead I/O requests 
are issued at the kernel level of both the guest and the hy- 
pervisor. Basically, readahead I/O requests populate the 
page cache with data already read from the block device, 
so that subsequent reads from the accessed files do not 
block on other I/O requests. As a result, it decreases the 
number of accesses to the block device. In particular, at 
the hypervisor level, a host file system issues readahead 


USENIX Association 


USENIX Association 





400 
S 300 M@ Queued 
° || Served 
= 
% 200 
O 
< 100 
” 
3 i 








0 
Ext3/JFS Ext3/BD 
Guest level 


Ext3/JFS Ext3/BD 
Hypervisor level 


Figure 9: Disk I/Os under sequential read workload 
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Figure 10: Cache hit ratio under sequential read work- 
load. 


requests and attempts to minimize the frequent accesses 
on the flat file by caching the subsequently accessed con- 
tents and metadata in the physical memory. Therefore, 
the I/Os served at the hypervisor level are much fewer 
than those at the guest level. 

However, when accessing a raw disk partition, there 
is no readahead. Thus, for sequential workloads, a host 
file system outperforms a raw disk partition due to more 
effective caching. This discrepancy of data caching at the 
hypervisor level is clearly shown in Figure 10. 

Second, to optimize I/O requests being served on the 
block device, the hypervisor’s block layer attempts to 
reduce the number of accesses into the block device 
by sorting and merging queued I/O requests. However, 
when many I/O requests are sorted and merged, they 
need to stay longer in the queue than normal. For JFS 
(host file system), as shown in Figure 9, due to the ef- 
fective caching, much fewer I/O requests are sent to the 
disk, and thus much fewer sorting/merging activities oc- 
cur at the I/O queue. However, when a raw partition is 
used, much more I/O requests need to be sorted/merged. 
The sorting/merging activities cause a higher idle time 
(I2D) for I/O requests being served on the block device 
than those on the JFS (host file system). This behavior is 
depicted in Figure 11 (hypervisor level). 

Remark: When a flat file is used as a guest VM’s log- 
ical block device, sequential read dominated workloads 
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Figure 11: I/O times under sequential read workload. 


can take advantage of the readahead at the hypervisor, 
achieving effective data caching. In contrast, when a disk 
partition is used, there is no readahead and data caching. 
Therefore, for all file systems, to gain high I/O perfor- 
mance, we recommend cloud administrators to select a 
flat file over raw partitions for services dominated by se- 
quential reads. 


4.3.2 Sequential Write Workload 


Our investigation uncovers the root causes of the nested 
file systems’ performance dependency under a sequential 
write workload in two cases: (A) two file system combi- 
nations hold the same host file system, and (B) two com- 
binations hold the same guest file system. The analysis 
detailed below focuses on two principal factors: sensitiv- 
ity of an I/O scheduler and effectiveness of block alloca- 
tion mechanisms. 


A. Different guests (Ext3, JFS) on the same host 
(ReiserFS) As shown in Figure 8 (B), we can see that 
the I/O performance of Ext3/ReiserFS is much worse 
than that of Ext3/BD, while the I/O performance of 
JFS/ReiserFS is much better than JFS/BD. At the guest 
level, we analyze the performance dependency of Ext3 
and JFS based on the comparison of their I/O character- 
istics. The details of this comparison are shown in Fig- 
ure:15, 

Figure 13 (A) shows that most I/Os issued from Ext3 
and sent to the block layer are well merged at the guest 
level’s I/O scheduler. The effective merging of I/Os sig- 
nificantly reduces the number of I/Os to be served on 
Ext3 (guest). Meanwhile, Figure 13 (B) shows that 99% 
I/Os of Ext3 are in small size (8K) and those of JES is 
68%. Apparently, merging multiple small size I/Os in- 
curs additional overhead. This is because the small re- 
quests have to be waited longer in the queue in order to 
be merged, thus, increasing their idle times. This behav- 
ior is illustrated in Figure 13 (C). 

To understand the root cause of merging happened on 
Ext3 and JFS (guest), we perform a deep analysis by 
monitoring every issued I/O activities at the guest level. 
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Figure 14: I/O characteristics at hypervisor level: (A) disk I/Os, (B) average I/O time, and (C) disk seeks. 
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What we found is that the block allocation mechanism 
causes this performance variation. To minimize disk 
seeks, Ext3 issues I/Os to allocate blocks of data on disk 
close to each other. The data includes regular data file, its 
metadata, and journal logs of metadata. This allocation 
scheme makes most I/Os be back merged. A back merge 
behavior denotes that a new request sequentially falls be- 
hind an exiting request on an order of the start sector, as 
they are logically adjacent. Note that two I/Os are logi- 
cally adjacent when the end sector of one I/O is logically 
located next to the begin sector of the other I/O. As we 
can see, clustering adjacent I/Os facilitates the data ac- 
cess. However, it requires the issued I/Os to be waited 
longer in the queue for being processed. 

JFS is more efficient than Ext3 in journaling. For reg- 
ular data file written into disk, both Ext3 and JES effec- 
tively coalescence multiple write operations to reduce the 
number of I/O committed into disk. However, for meta- 
data and journal logs, instead of independently commit- 
ting every single concurrent log entry as Ext3, JFS re- 
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quires multiple concurrent log entries to be coalesced as 
one commit. For this reason, as shown in Figure 12, JFS 
has less I/Os spent for journaling, resulting in less per- 
formance degradation. 

Remarks: The efficiency provided by the I/O sched- 
uler’s optimization is no longer valid for all nested file 
systems. Since file systems allocate blocks on disk dif- 
ferently, nested file systems have different impacts on 
performance when one particular I/O scheduler is used. 
Therefore, a nested file system should be chosen based 
on the effectiveness of underlying I/O scheduler’s opera- 
tions on its block allocation scheme. 


B. Same guest (JFS) on different hosts (ReiserFS, 
XFS) Based on results of sequential writes shown in 
Figure 8 (B), JFS (guest) performs better on ReiserFS 
than on XFS. We analyze I/O activities of these host file 
systems to uncover differences of their block allocation 
mechanisms. The detailed analysis is given below. 

The analysis of I/O activities reveals that the I/O 
scheduler processes ReiserFS’ I/Os similarly to those of 
XFS. As shown in Figure 14 (A), the number of host file 
systems’ I/Os to be queued and served are fairly simi- 
lar in ReiserFS and XFS. However, Figure 14 (B) de- 
notes that XFS’ I/Os are executed slower than those of 
ReiserFS. A further analysis is needed to explain this be- 
havior. In general, file systems allocate blocks on disk 
differently, thus, resulting in a different execution time 
for I/Os. For this reason, we perform an analysis on the 
disk seeks. Based on the results shown in Figure 14 (C), 
we find that long distance disk seeks on XFS cause high 
overhead and reduce its I/O performance. Note that in 
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Figure 15: Extra data written into disk under the same 
workload from JFS (guest). 


Figure 14 (C), the x-axis is represented as a normalized 
seek distance and 1 denotes the longest seek distance of 
the disk head, from one end to the other end of the parti- 
tion. 

With respect to the case of one host file system allo- 
cates disk blocks more effectively than another under the 
same workload, we analyze the mechanisms to allocate 
disk blocks of ReiserFS and XFS and find that XFS in- 
duces an overhead because of a multiple journal logging. 
The detailed explanations are as follows: 

A multiple logging mechanism of metadata also incurs 
an overhead on XFS. Basically, XFS is able to record 
multiple separate changes occurred on the metadata of a 
single file and store them into journal logs. This tech- 
nique effectively avoids such changes to be flushed into 
disk before another new change will be logged. How- 
ever, every change of metadata can be range from 256 
Bytes to 2 KB in size, while the default size of the log 
buffer is only 32 KB. Under an intensive write dominated 
workload, this small log buffer causes multiple changes 
of the file metadata to be frequently logged. As shown 
in Figure 15, this repeatedly logging produces extra data 
written into disk, thus, resulting in a performance loss. 

Remarks: (1) An effective block allocation of one 
particular file system no longer guarantees a high per- 
formance when it runs on top of another file system. (2) 
Under an intensive write dominated workload, an update 
of journal logs on disk should be carefully considered to 
avoid performance degradation. Especially for XFS, the 
majority of its performance loss is attributed to not only a 
placement of journal logs, but also a technique to handle 
updates of these logs. 


5 Discussion 


Despite various practical benefits in using nested file sys- 
tems in a virtualized environment, our experiments have 
shown the associated performance overhead to be signifi- 
cant if not configured properly. Here we offer five advice 
on choosing the right guest/host file system configura- 
tions to minimize performance degradation, or in some 
cases, even improve performance. 
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Figure 16: (hypervisor level) Extra data written into disk 
under a write-dominated workload from guest VM. 


Advice 1 For workloads that are read-dominated (both 
sequential and random), using nested file systems has 
minimal impact on I/O throughput, independent of guest 
and host file systems. For workloads that have a signifi- 
cant amount of sequential reads, nested file systems can 
even improve throughput due to the readahead mecha- 
nism at the host level. 


Advice 2. On the other hand, for workloads that are 
write-dominated, one should avoid using nested file sys- 
tems in general due to 1) one more layer to pass through 
and ii) additional metadata update operations. If one 
must use nested file systems, journaled file systems in the 
host should be avoided. Journaling of both metadata and 
data can cause significant performance degradation, and 
therefore, is not practical to use for most workloads, and 
if only metadata is journaled, a crash can corrupt a VM 
image file easily, thus, giving no benefit to metadata-only 
journaling mode in the host. As shown in Figure 16, the 
additional metadata writes to the journal log can result in 
significantly more I/O traffic. Performance is even more 
impacted if the location of the log is placed far away from 
either the metadata or the data locations. 


Advice 3 For workloads that are sensitive to I/O la- 
tency, one should also avoid using nested file systems. 
As shown in Figure 6, even in the best case scenarios, 
nested file systems could increase I/O latency by 10-30% 
due to having an additional layer of file system to traverse 
and one more I/O queue to wait for. 


Advice 4 In a nested file system, data and metadata 
placement decisions are made twice, first in the guest file 
system and then in the host file system. Guest file system 
uses various temporal and spatial heuristics to place re- 
lated metadata and data blocks close to each other. How- 
ever, when these placement decisions reach the host file 
system, it can no longer differentiate between data and 
metadata and treats everything as data. As a result, the 
secondary data placement decisions made by a host file 
system are both unnecessary and less efficient than those 
made by a guest file system. Ideally, the host file sys- 
tem should simply act as a pass-through layer such as 
VirtFS [22]. 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


of 


98 


Advice5 [nour experiments, we used the default set of 
formatting and mounting parameters in all the file sys- 
tems. However, just like in a non-virtualized environ- 
ment, these parameters can be tuned to improve perfor- 
mance. There are more benefits in tuning the host file 
system’s parameters than guest’s as it is ultimately the 
layer that communicates with the storage device. 

One should tune its parameters in such a way that the 
host file system most resembles a “dumb” disk. For ex- 
ample, when a disk is instructed to read a small disk 
block, it will actually read the entire track or cylinder 
and keep them in its internal cache to minimize mechan- 
ical movement for future I/O requests. A host file system 
can emulate this behavior by using larger block sizes. 

Metadata operations at host file system is another 
source of overhead. When a VM image file is accessed 
or modified, its metadata often has to be modified, thus, 
causing additional I/O load. Parameters such as noat- 
ime and nodiratime can be used to avoid updating the 
last access time without losing any useful information. 
However, when the image file is modified, there is no 
option to avoid updating the metadata. As the image file 
will stay constant in size and ownership, the only field in 
the metadata that needs to be updated is the last modi- 
fied time, which for an image file is just pure overhead. 
Perhaps this can be implemented as a file system mount 
option. Note that journaling, as mentioned previously, in 
the metadata-only mode has very little usage in the host 
level. 

Lastly, using more advanced file system features to 
configure block groups and B+ trees to perform intelli- 
gent data allocation and balancing tasks will most likely 
be counter-productive. This is because these features will 
cause guest file system’s view of disk layout to deviate 
further from the reality. 


6 Conclusion 


Our main objective is to better understand performance 
implications when file systems are nested in a virtual- 
ized environment. The major finding is that the choice 
of nested file systems on both hypervisor and guest lev- 
els has a significant performance impact on I/O perfor- 
mance. Traditionally, a guest file system is chosen based 
on the anticipated workload, regardless of the host file 
system. By examining a large set of different combina- 
tions of host and guest file systems under various work- 
loads, we have demonstrated the significant dependency 
of the two layers on performance, and hence, system ad- 
ministrators must be careful in choosing both file systems 
in order to reap the greatest benefit from virtualization. 
In particular, if workloads are sensitive to I/O latency, 
nested file systems should be avoided or host file sys- 
tems should simply perform as a pass-through layer in 
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certain cases. 

The intricate interactions between host and guest file 
systems represent an exciting and challenging optimiza- 
tion space for improving I/O performance in virtualized 
environments. Our preliminary investigation on nested 
file systems will help researchers to better understand 
critical performance issues in this area, and shed light on 
finding more efficient methods in utilizing virtual stor- 
age. We hope that our work will motivate system design- 
ers to more carefully analyze the performance gap at the 
real and virtual boundaries. 
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Appendix 


We have conducted experiments with the database work- 
load to verify if the I/O performance of nested file sys- 
tems is hypervisor-dependent. The chosen hypervisors 
are architecturally akin to KVM, such as VMware Player 
3.1.4 with guest tools [9], and Xen 4.0 with Xen para- 
virtualized device drivers [12]. Figure 17 shows that the 
I/O performance variations of guest file systems on Xen 
and VMware are fairly similar to those on KVM. 
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Abstract 


Modern file systems use ordering points to maintain con- 
sistency in the face of system crashes. However, such 
ordering leads to lower performance, higher complexity, 
and a strong and perhaps naive dependence on lower lay- 
ers to correctly enforce the ordering of writes. In this 
paper, we introduce the No-Order File System (No FS), 
a simple, lightweight file system that employs a novel 
technique called backpointer-based consistency to pro- 
vide crash consistency without ordering writes as they go 
to disk. We utilize a formal model to prove that NoFS 
provides data consistency in the event of system crashes; 
we show through experiments that NoFS is robust to such 
crashes, and delivers excellent performance across a range 
of workloads. Backpointer-based consistency thus allows 
NoFS to provide crash consistency without resorting to 
the heavyweight machinery of traditional approaches. 


1 Introduction 


One of the core problems in file systems research over the 
years has been the challenge of providing consistency in 
the presence of system crashes. There have been a num- 
ber of solutions to tackle this problem: from the simple 
file-system check [20] of the Fast File System [18] to the 
complicated copy-on-write mechanism of ZFS [3]. Each 
approach has a different core technique: write-ahead log- 
ging [12], copy-on-write [15] or tracking dependencies 
among writes to disk [10]. 

Although these approaches all differ vastly in their de- 
tails, they share one common trait: each uses a careful 
ordering of writes to implement its update protocol. Jour- 
naling file systems require that metadata and data are per- 
sisted before the commit record 1s written [2, 31, 41, 45]. 
Copy-on-write file systems require that the root block be 
updated only after the rest of the update is safely on disk 
[15, 32, 40, 48]. Soft updates is built entirely around the 
careful ordering of disk writes [10]. 

In the event of a crash, ordering points allow the file 
system to reason about which writes reached the disk and 
which did not, enabling the file system to take correc- 
tive measures, such as replaying the writes, to recover. 
Unfortunately, ordering points are not without their own 
set of problems. By their very nature, ordering points 
introduce waiting into the file-system code, thus poten- 
tially lowering performance. They constrain the schedul- 
ing of disk writes, both at the operating system level and 


at the disk driver level. They introduce complexity into 
the file-system code, which leads to bugs and lower re- 
liability [25, 26, 49, 50]. The use of ordering points also 
forces file systems to ignore the end-to-end argument [34], 
as the support of lower-level systems and disk firmware 
is required to implement imperatives such as the disk 
cache flush. When such imperatives are not properly im- 
plemented [36], file-system consistency is compromised 
[29]. In today’s cloud computing environment [1], the 
operating system runs on top of a tall stack of virtual de- 
vices, and only one of them needs to neglect to enforce 
write ordering [47] for file-system consistency to fail. 

We can thus summarize the current state of the art in 
file-system crash consistency as follows. At one extreme 
is a lazy, optimistic approach that writes blocks to disks in 
any order (e.g., ext2 [4]); this technique does not add over- 
head or induce extra delays at run-time, but requires an ex- 
pensive (and often prohibitive) disk scan after a crash. At 
the other extreme are eager, pessimistic approaches that 
carefully order disk writes (e.g., ZFS or ext3); these tech- 
niques pay a perpetual performance penalty in return for 
consistency guarantees and quick recovery. We seek to 
obtain the best of both worlds: the simplicity and perfor- 
mance benefits of the lazy approach with the strong con- 
sistency and availability of eager file systems. 

We present the No-Order file system (NoFS), a simple, 
optimistic, lightweight file system which maintains con- 
sistency without resorting to the use of ordering. NoFS 
employs a new approach to providing consistency called 
backpointer-based consistency, which is built upon refer- 
ences in each file-system object to the files or directories 
that own it. We extend a logical framework for file sys- 
tems [38] to prove that the incorporation of backpointer- 
based consistency in an order-less file system guarantees a 
certain level of consistency. We simplify the update proto- 
col through non-persistent allocation structures, reducing 
the number of blocks that need to reach disk to success- 
fully complete an operation. 

Through reliability experiments, we demonstrate that 
NoFS is able to detect and handle a wide range of incon- 
sistencies. We compare the performance of NoFS with 
ext2, an order-less file system with no consistency guar- 
antees, and ext3, a journaling file system with metadata 
consistency. We show that NoFS has excellent perfor- 
mance overall, matching or exceeding the performance of 
ext2 and ext3 on various workloads. We also discuss the 
limitations of our approach. 
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2 Background 


File systems use a number of data structures to keep track 
of the data on disk. These include allocation structures 
such as bitmaps, and metadata such as inodes. In order to 
do a single operation such as file creation, multiple data 
structures have to be updated on disk. For example, in the 
ext2 file system [4], in order to create an empty file, the 
inode bitmap, the parent inode, the parent directory, and 
the child inode all need to be updated and written to disk. 

The problem of file-system consistency arises because 
the system may crash at any time, resulting in some of 
the updates persisting, and other updates being lost. File- 
system inconsistency manifests in different ways: a miss- 
ing file, a file with garbage data, or in some cases, an un- 
mountable file system. File systems have different solu- 
tions to this problem, with varying levels of consistency. 

We first examine the different levels of consistency pro- 
vided by file systems, describing the guarantees provided 
by each level. We then examine the techniques used in file 
systems to provide consistency and show that all of them 
(except the file-system check) have at least one ordering 
point in their update protocols. We discuss the disadvan- 
tages of having ordering points and motivate the design of 
our order-less file system. 


2.1 File-system consistency 

There are many levels of consistency in file systems, dif- 
fering in terms of guarantees provided for data and meta- 
data blocks. An inconsistency could be caused by many 
things: a hardware error, memory corruption, or a system 
crash. In this work, we are only concerned with inconsis- 
tencies occurring due to a system crash. 

Metadata consistency: The metadata structures of the 
file system are entirely consistent with each other. There 
are no dangling files and no duplicate pointers. The coun- 
ters and bitmaps of the file system, which keep track of 
resource usage, match with the actual usage of resources 
on the disk. Therefore a resource is in use if and only 
if the bitmaps say that it is in use. Metadata consistency 
does not provide any guarantees about data. 

Data consistency: Data consistency is a stronger form 
of metadata consistency. Along with the guarantee about 
metadata, there is the additional guarantee that all data 
that is read by a file belongs to that file. In other words, a 
read of file A may not return garbage data, or data belong- 
ing to some file B. It is possible that the read may return 
an older version of the data of file A. 

Version consistency: Version consistency is a stronger 
form of data consistency with the additional guarantee 
that the metadata version matches the version of the re- 
ferred data. For example, consider a file with a single data 
block. The data block is overwritten, and a new block is 
added, thereby changing the file version: the old version 
had one block, and the new version has two blocks. Ver- 
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sion consistency guarantees that a read of the file does not 
return old data from the first block and new data from the 
second block (since the read would return the old version 
of the data block and the new version of the file metadata). 


2.2 Techniques for providing consistency 

In this section, we review different approaches to provid- 
ing consistency in file systems. We point out where order- 
ing points are needed in each of the techniques, except for 
file-system checks. An ordering point signifies that some 
blocks need to be persistent on disk before other blocks. 
For example, an update protocol might require that all the 
file-system metadata reach the disk before all the data. 


2.2.1 File-system check 

The file-system check is the simplest solution to the con- 
sistency problem: let the system crash and become in- 
consistent, and upon reboot, fix the inconsistencies. This 
technique was used in the Fast File System [18, 20] and 
the ext2 file system [4]. No extra actions are required dur- 
ing runtime, allowing the file system to execute without 
any performance degradation. The simplicity comes with 
a high cost: the entire disk needs to be scanned before in- 
consistencies can be fixed in the file system. While this 
was acceptable for early file systems that were megabytes 
in size, scanning an entire disk (or worse, a large RAID 
volume [23]) would require hours in modern systems. 
Though several optimizations were developed to reduce 
the running time of the file-system check [13, 19, 24], it is 
still too expensive for large volumes, prompting the file- 
system community to turn to other solutions. 

File systems that depend upon on the file-system check 
alone for consistency cannot provide data consistency, 
since there is no way for the file system to differentiate 
between valid data and garbage in a data block. Therefore 
file reads may return garbage after a crash. The state of 
every metadata structure is known after the disk scan, and 
hence duplicate resource allocation and orphan resources 
can be handled, ensuring metadata consistency. 


2.2.2 Journaling 

Journaling uses the idea of write-ahead logging [12] to 
solve the consistency problem: metadata (and sometimes 
data) is first logged to a separate location on disk, and 
when all writes have safely reached the disk, the infor- 
mation is written into its original place in the file system. 
Over the years, this technique has been incorporated into 
a number of file systems such as NTFS [21], JFS [2], XFS 
[41], ReiserFS [31], and ext3 [45, 46]. 

Journaling file systems offer data or metadata consis- 
tency based on whether data is journaled or not. Both 
journaling modes use at least one ordering point in their 
update protocols, where they wait for the journal writes 
to be persisted on disk before writing the commit block. 
Journaling file systems often perform worse than their 
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order-less peers, since information needs to be first writ- 
ten to the log and then later to the correct location on disk. 
Recovery of the journal is needed after a crash, but it is 
usually much faster than the file-system check. 


2.2.3 Soft updates 
Soft updates involves tracking dependencies among in- 
memory copies of metadata blocks, and carefully order- 
ing the writes to disk such that the disk always sees con- 
tent that is consistent with the other disk metadata. In 
order to do this, it may sometimes be necessary to roll 
back updates to a block at the time of write, and roll- 
forward the update later. Soft updates was implemented 
for FFS, and enabled FFS to achieve performance close 
to that of a memory-based file system [10] . However, 
it was extremely tricky to implement the ordering rules 
correctly, leading to numerous bugs. Though the Feather- 
stitch project [9] reduces the complexity of soft updates, 
the idea has not spread beyond the BSD distributions. 
Soft updates provide metadata and data consistency at 
low cost. FFS with soft updates cannot tell the differ- 
ence between different versions of data, and hence does 
not provide version consistency. Soft updates also pro- 
vide high availability since a blocking file-system check 
is not required; instead, upon reboot after a crash, a snap- 
shot of the file-system state is taken, and the file-system 
check is run on the snapshot in the background [19]. 


2.2.4 Copy-on-write 

The copy-on-write technique, as the name suggests, di- 
rects a write to a metadata or data block to a new copy of 
the block, never overwriting the block in place. Once the 
write is persisted on disk, the new information is added 
to the file-system tree. The ordering point is in-between 
these two steps, where the file system atomically changes 
between the old view of the metadata to one which in- 
cludes the new information. Copy-on-write has been used 
in a number of file systems [15, 32], with the most recent 
being ZFS [3] and btrfs [48]. 

Copy-on-write file systems provide metadata, data, and 
version consistency due to the use of logging and trans- 
actions. Modern copy-on-write file systems like ZFS 
achieve good performance, though at the cost of very high 
complexity. The large size of these file systems (tens of 
thousands of lines of code [35]) is partly due to the copy- 
on-write technique, and partly due to advanced features 
such as storage pools and snapshots. 


2.3 Summary 

Table 1 compares consistency techniques on complexity, 
performance, availability, and consistency guarantees pro- 
vided. Observe that every technique that provides consis- 
tency and availability in file systems uses ordering points 
in its update protocol. Ordering points lead to complexity 
in the file-system code, paving the way for bugs and de- 
creased reliability. File systems which use ordering points 
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Availability 


Technique 
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MMH 
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File-system check 
Metadata journaling 
Data journaling 
Soft Updates 
Copy-on-write 
BBC 
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Table 1: Consistency techniques. The table compares var- 
ious approaches to providing consistency in file systems. Leg- 
end: L — Low, M — Medium, H — High. We observe that only 
backpointer-based consistency (BBC) provides data consistency 
with low complexity, high performance, and high availability. 


perform worse than order-less file systems on some work- 
loads. The use of ordering points is built upon lower- 
level functionality such as the SATA flush command [43]; 
when disks do not reliably flush their cache [36], ordering 
points fail to enforce consistency and more complicated 
measures have to be taken [29]. Thus there is a need fora 
technique which provides consistency without sacrificing 
simplicity, availability, or performance. We believe that 
backpointer-based consistency fulfills this need. 


3 Design 


We present the design of the No-Order file system (NoFS), 
a lightweight, consistent file system with no ordering 
points in its update protocol. NoFS provides access to 
files immediately upon mounting, with no need for a file- 
system check or journal recovery. 

In this section, we introduce backpointer-based consis- 
tency (BBC), the technique used in NoFS for maintaining 
consistency. We use a logical framework to prove that 
BBC provides data consistency in NoFS. We discuss how 
BBC can be used to detect and recover from inconsisten- 
cies, and elaborate on why allocation structures are not 
persisted to disk in NoFS. 


3.1 Overview 
The main challenge in NoFS is maintaining consistency 
without ordering points. Consistency is closely tied to 
logical identity in file systems. Inconsistencies arise due 
to confusion about an object’s identity; for example, two 
files may each claim to own a data block. If the block’s 
true owner is known, such inconsistencies could be re- 
solved. Associating each object with its logical identity is 
the crux of the backpointer-based consistency technique. 
Employing backpointer-based consistency allows 
NoFS to detect inconsistencies on-the-fly, upon user 
access to corrupt files and directories. The presence of 
a corrupt file does not affect access to other files in any 
way. This property enables immediate access to files 
upon mounting, avoiding the downtime of a file-system 
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check or journal recovery. A read is guaranteed to never 
return garbage data, though stale data may be returned. 

We intentionally avoided using complex rules and de- 
pendencies in NoFS. We simplified the update protocols, 
not persisting allocation structures to disk. We maintain 
in-memory versions of allocation structures and discover 
data and metadata allocation information in the back- 
ground while the file system is running. 


3.2 Backpointer-based consistency 
Backpointer-based consistency is built around the logical 
identity of file-system objects. The logical identity of a 
data block is the file it belongs to, along with its position 
inside the file. The logical identity of a file is the list of 
directories that it is linked to. This information is em- 
bedded inside each object in the form of a backpointer. 
Upon examining the backpointer of an object, the parent 
file or directory can be determined instantly. Blocks have 
only one owner, while files are allowed to have multiple 
parents. Figure | illustrates how backpointers link file- 
system objects in NoFS. As each object in the file system 
is examined, a consistent view of the file-system state can 
be incrementally built up. 

Though conceptually simple, backpointers allow detec- 
tion of a wide range of inconsistencies. Consider a block 
that is deleted from a file, and then assigned to another 
file and overwritten. If a crash happens at any point dur- 
ing these operations, some subset of the data structures on 
disk may not be updated, and both files may contain point- 
ers to the block. However, by examining the backpointer 
of the block, the true owner of the block can be identified. 

In designing NoFS, we assume that the write of a block 
along with its backpointer is atomic. This assumption is 
key to our design, as we infer the owner of the data block 
by examining the backpointer. Current SCSI drives allow 
a 520-byte atomic write to enable checksums along with 
each 512-byte sector [42]; we envision that future drives 
with 4-KB blocks will provide similar functionality. 

Backpointers are similar to checksums in that they ver- 
ify that the block pointed to by the inode actually belongs 
to the inode. However, a checksum does not identify the 
owner of a data block; it can only confirm that the cor- 
rect block is being pointed to. Consistency and recovery 
require identification of the owner. 


3.2.1 Intuition 
We briefly provide some intuition about the correctness of 
using the backpointer-based consistency technique to en- 
sure data consistency. We first consider what data consis- 
tency and version consistency mean, and the file-system 
structures required to ensure each level of consistency. 
Data consistency provides the guarantee that all the 
data accessed by a file belongs to that file; it may not 
be garbage data or belong to another file. This guarantee 
is obtained when a backpointer is added to a data block. 
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Figure 1: Backpointers. The figure shows a conceptual view 
of the backpointers present in NoF'S. The file has a backpointer 
to the directory that it belongs to. The data block has a back- 
pointer to the file it belong to. Files and directories have many 
backpointers while data blocks have a single backpointer. 


Consider a file pointing to a data block. Upon reading 
the data block, the backpointer is examined. If the back- 
pointer matches the file, then the data block must have 
belonged to the file, since the backpointer and the data 
inside the block were written together. If the data block 
was reallocated to another file and written, it would be re- 
flected in the backpointer. Hence, no ordering is required 
between writes to data and metadata since the data block’s 
backpointer would disagree in the event of a crash. Note 
that the data block could have belonged to the file at some 
point in the past; the backpointer does not provide any in- 
formation about when the data block belonged to the file. 
Thus, the file might be pointing to an old version of the 
data block, which is allowed under data consistency. 

Version consistency is a stricter form of data consis- 
tency which requires that in addition to belonging to the 
correct file, all accessed data must be the correct version. 
Stale data is not allowed in this model. Backpointers 
are not sufficient to enforce version consistency, as they 
contain no information about the version of a data block. 
Hence more information needs to be added to the file sys- 
tem. Each data block has a timestamp indicating when it 
was last updated. This timestamp is also stored in the in- 
ode containing the data block. When a block is accessed, 
the timestamp in the inode and data block must match. 
Since timestamps are a way to track versions, the versions 
in the inode and data block can be verified to be the same, 
thereby providing version consistency. 

We decided against including timestamps in NoFS 
backpointers because updating timestamps in backpoint- 
ers and metadata reduces performance and induces a con- 
siderable amount of storage overhead. Timestamps need 
to be stored with every object and its parent. Every up- 
date to an object involves an update to the parent object, 
the parent’s parent, and so on all the way up to the root. 
Furthermore, doing so works against our goal of keeping 
the file system simple and lightweight; hence, NoFS pro- 
vides data consistency, but not version consistency. 

The full proof involves extending the logical framework 
of Sivathanu et al. [38] to prove that an order-less file sys- 
tem employing the backpointer-based consistency tech- 
nique provides data consistency. We further prove that 
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if the backpointer contains an update timestamp, the file 
system provides version consistency. The full proof can 
be found in the technical report [5]. 


3.2.2 Detection and Recovery 

In NoFS, detection of an inconsistency happens upon ac- 
cess to corrupt files or data. When a data or metadata 
block is accessed, the backpointer is checked to verify that 
the parent metadata block has the same information. If a 
file is not accessed, its backpointer is not checked, which 
is why the presence of corrupt files does not affect access 
to other files: checking is performed on-demand. 

This checking happens both at the file level and the data 
block level. When a file is accessed, it is checked to see 
whether it has a backpointer to its parent directory. This 
check allows identification of deleted files where the di- 
rectory did not get updated, and files which have not been 
properly updated on disk. 

NoFS is able to recover from inconsistencies by treating 
the backpointer as the true source of information. When 
a directory and a file disagree on whether the file belongs 
to the directory or not, the backpointer in the file is exam- 
ined. If the backpointer to the directory is not found, the 
file is deleted from the directory. Issues involving blocks 
belonging to files are similarly handled. 


3.3. Non-persistent allocation structures 

In an order-less file system, allocation structures like 
bitmaps cannot be trusted after a crash, as it is not known 
which updates were applied to the allocation structures on 
disk at the time of the crash. Any allocation structure will 
need to be verified before it can be used. In the case of 
global allocation structures, all of the data and metadata 
referenced by the structure will need to be examined to 
verify the allocation structure. 

Due to these complexities, we have simplified the up- 
date protocols in NoFS, making the allocation structures 
non-persistent. The allocation structures are kept entirely 
in-memory. NoFS starts out with empty allocation struc- 
tures and allocation information is discovered in the back- 
ground, while the file system is online. NoFS can verify 
whether a block is in use by checking the file that it has a 
backpointer to; if the file refers to the data block, the data 
block is considered to be in use. Similarly, NoFS can ver- 
ify whether a file exists or not by checking the directories 
in its backpointers. Thus NoFS can incrementally learn 
allocation information about files and blocks. 


4 Implementation 


We now present the implementation of NoFS. We first de- 
scribe the operating system environment, and then discuss 
the implementation of the two main components of NoFS: 
backpointers and non-persistent allocation structures. We 
describe the backpointer operations that NoFS performs 
for each file-system operation. 


Action 


Backpointer operations 





Create | Write backlink into new inode 
Read | Translate offset 
Verify block backpointer in data block 
Write | Translate offset 
Verify block backpointer in data block 
Append | Translate offset 
Write block backpointer into data block 
Truncate | No backpointer operations 
Delete | No backpointer operations 
Link | Write backlink into inode 
Unlink | Remove backlink from inode 
mkdir | Write directory entry backpointer into 
directory block 
rmdir | No backpointer operations 


Table 2: NoFS backpointer operations. The table lists 
the operations on backpointers caused by common file system 
operations. Note that all checks are done in memory. 


4.1 Operating system environment 
NoFS is implemented as a loadable kernel module in- 
side Linux 2.6.27.55. We developed NoFS based on ext2 
file-system code. Since NoFS involves changes to the 
file-system layout, we modified the e2fsprogs tools 
1.41.14 [44] used for creating the file system. 

Linux file systems cache user data in a unified page 
cache [6]. File reads (except direct I/O) are always sat- 
isfied from the page cache. If the page is not up-to-date at 
the time of read, the page is first filled with data from the 
disk and then returned to the user. File writes cause pages 
to become dirty, and an I/O daemon called pdf 1ush pe- 
riodically flushes dirty pages to disk. Due to this tight 
integration between the page cache and the file system, 
NoFS involves modifications to the Linux page cache. 


4.2 Backpointers 

NoFS contains three types of backpointers. We describe 
each of them in turn, pointing out the objects they con- 
ceptually link, and how they are implemented in NoFS. 
Figure 2 illustrates how various objects are linked by dif- 
ferent backpointers. Every file-system operation that in- 
volves the creation or access of a file, directory, or data 
block involves an operation on backpointers. These oper- 
ations are listed in Table 2. 


4.2.1 Block backpointers 

Block backpointers are {inode number, block offset} pairs, 
embedded inside each data block in the file system. The 
first 8 bytes of every data block are reserved for the back- 
pointer. Note that we need to embed the backpointer i1n- 
side the data block since disks currently do not provide 
the ability to store extra data along with each 4K block 
atomically. The first 4 bytes denote the inode number of 
the file to which the data block belongs. The second 4 
bytes represent the logical block offset of the data block 
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Figure 2: Implementation of backpointers. The figure 
shows the different kinds of backpointers present in NoFS. foo is 
a child of the root inode /. This link is represented by a backlink 
from foo to /. Similarly, the data block is a part of foo, and 
hence has a backpointer to foo. Directory blocks also contain 
backpointers, in the form of dot entries to their owner’s inode. 


within the file. Given this information, it is easy to check 
whether the file contains a pointer to the data block at the 
specified offset. Indirect blocks contain backpointers too, 
since they belong to a particular file. However, since the 
indirect block data is not logically part of a file, they are 
marked with a negative number for the offset. 

Our implementation depends on the read and write 
system calls being used; data is modified as it is passed 
from the page cache to the user buffer and back during 
these calls. When these calls are by-passed (via mmap) 
or the page cache itself is by-passed (via direct IO mode), 
verifying each access becomes challenging and expensive. 
We do not support mmap or direct IO mode in NoFS. 

Insertion: The data from a write system call goes 
through the page cache before being written to disk. We 
modified the page cache so that when a page is requested 
for a disk write, the backpointer is written into the page 
first and then returned for writing. The block offset trans- 
lation was modified to take the backpointer into account 
when translating a logical offset into a block number. 

Verification: Once a page is populated with data from 
the disk, the page is checked for the correct backpointer. 
If the check fails, an I/O error is returned. If this is the first 
time that the data block is accessed, the inode’s attributes 
(size and number of blocks) are updated. Note that the 
page is not checked on every access, but only the first time 
that it is read from disk. Assuming memory corruption 
does not occur [51], this level of checking is sufficient. 


4.2.2 Directory backpointers 


The dot directory entry serves as the backpointer for di- 
rectory blocks, as it points to the inode which owns the 
block. However, the dot entry is only present in the first 
directory block. We modified ext2 to embed the dot entry 
in every directory block, thus allowing the owner of any 
directory block to be identified using the dot entry. 
Though the block backpointer could have been used in 
directory blocks as well, we did not do so for two reasons. 
First, the structured content of the directory block enables 


FAST 712: 10th USENIX Conference on File and Storage Technologies 





e——> Data block backpointer @— ++ -® Inode Backlink 


the use of the dot entry as the backpointer, simplifying our 
implementation. Second, the offset part of the block back- 
pointer is unnecessary for directory blocks since directory 
blocks are unordered and appending a directory block at 
the end suffices for recovery. 

Insertion: When a new directory entry is being added 
to the inode, it is determined whether a new directory 
block will be needed. If so, the dot entry in added in the 
new block, followed by the original directory entry. 

Verification: Whenever the directory block is ac- 
cessed, such as in readdir, the dot entry is cross- 
checked with the inode. If the check fails, an I/O error 
is returned and the directory inode’s attributes (size and 
block count) are updated. 


4.2.3. Backlinks 

An inode’s backlinks contain the inode numbers of all its 
parent directories. Every valid inode must have at least 
one parent. Hard linked inodes may have multiple parents. 

We modified the file-system layout to add space for 
backlinks inside each inode. The inode size is increased 
from the default 128 bytes to 256 bytes, enabling the 
addition of 32 backlinks, each of size 4 bytes. The 
mke2fs tool was modified to create a backlink between 
the Lost+found directory and the root directory when 
the file system is created. 

Insertion: When a child inode is linked to a parent di- 
rectory during system calls such as create or link, a 
backlink to the parent is added in the child inode. 

Verification: At each step of the iterative inode lookup 
process, we check that the child inode contains a backlink 
to the parent. A failed check stops the lookup process and 
returns an I/O error. If this is the first time the inode is 
accessed via this particular path, the number of links for 
the inode is updated. 


4.2.4 Detection 

Every data block is checked for a valid backpointer when 
it is read from the disk into the page cache. We as- 
sume that neither memory nor on-disk corruption hap- 
pens; hence, it is safe to limit checking to when a data 
block is first brought into main memory. It is this property 
that leads to the high performance of NoFS; because disk 
I/O is several orders of magnitude slower than in-memory 
operations, the backpointer check can be performed on 
disk blocks with very low overhead. 

Inode backlink checking occurs during directory path 
resolution. The child inode’s backlink to the parent in- 
ode is checked. Since both inodes are typically in mem- 
ory during directory path resolution, the backlink check 
is a quick in-memory check, and does not degrade perfor- 
mance significantly, since a disk read is not performed to 
obtain the parent or child inode. 

Note that the detection of inconsistency happens at the 
level of a single resource, such as an inode or a data 
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Figure 3: Handling crashes with backpointers. The figure presents three failure scenarios during the rename of a file, and 
the creation of a file with I byte of data. In each scenario, employing backpointers allows us to detect inconsistencies such as both 
the old and new parents claiming the child, and the child pointing to a data block that hasn’t been updated. 


block. Verifying that a data block belongs to an inode 
can be done without considering any other object in the 
file system. The presence of corrupt files or blocks does 
not affect the reads or writes to other non-corrupt files. 
As long as corrupt blocks are not accessed, their presence 
can be safely ignored by the rest of the system. This fea- 
ture contributes to the high availability of NoFS: a file- 
system check or recovery protocol is not needed upon 
mount. Files can be immediately accessed, and any ac- 
cess of a corrupt file or block will return an error. This 
feature also allows NoFS to handle concurrent writes and 
deletes. Even if many writes and deletes were going on at 
the time of a crash, NoFS can still detect inconsistencies 
by considering each inode and data block pair in isolation. 


Let us illustrate this with an example. Upon mount, we 
run the command cat /diril/filel, which involves 
several checks in the file system. First, the directory block 
for dir1 is fetched, and checked whether it has a direc- 
tory backpointer to the root directory. Similarly, when the 
filel inode is retrieved from disk, it is checked to see if 
it has a backlink to dir1. When the data block of filel 
is retrieved, it is checked to verify that the data block has 
a block backpointer to file1. If any of these checks fail, 
an error is returned to the user. 


Figure 3 illustrates the detection of inconsistencies dur- 
ing different crash scenarios for two operations: renam- 
ing a file and creating a single byte file. The state of data 
structures in memory before and after the update is first 
Shown. In each crash scenario, a different subset of the 
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in-memory updates is successfully written to disk. The 
state of various pointers on disk after the crash is shown, 
followed by the consistent logical view that NoFS obtains 
after verification using back pointers. For example, dur- 
ing the rename, a crash may lead to the file being listed 
in both the old and new directories. However, the logical 
status shows that upon backpointer verification, the true 
owner of the child inode is found using the backlink. 


4.2.5 Recovery 


Having backlinks and backpointers allows recovery of lost 
files and blocks. Files can be lost due to a number of rea- 
sons. A rename operation consists of a unlink and a link 
operation. An inopportune crash could leave the inode not 
linked to any directory. A crash during the create opera- 
tion could also lead to a lost file. Such a lost file can be 
recovered in NoFS, due to the backlinks inside each in- 
ode. Each such inode is first checked for access to all its 
data blocks. If all the data blocks are valid, it is a valid 
subtree in the file system and can be inserted back into 
the directory hierarchy (using the backlinks information) 
without compromising the consistency of the file system. 
When adding a directory entry for the recovered inode, it 
is correct to append the directory entry at the end of the 
directory, since directory entries are an unordered collec- 
tion; there is no meaning attached to the exact offset inside 
a directory block where a directory entry is added. 


In a similar fashion, it it possible to recover data blocks 
lost due to a crash before the inode is updated. A data 
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block, once it has been determined to belong to an inode, 
cannot be embedded at an arbitrary point in the inode data. 
It is for this reason that the offset of a data block is embed- 
ded in the data block, along with the inode number. The 
offset allows a data block to be placed exactly where it 
belongs inside a file. Indirect blocks of a file do not have 
the offset embedded, as they do not have a logical offset 
within the file. Indirect blocks are not required to recon- 
struct a file; only data blocks and their offsets are needed. 

Using reconstruction of files from their blocks on disk, 
files can be potentially “undeleted”, provided that the 
blocks have not been reused for another file. We have not 
implemented undelete in NoFS. Block allocation would 
need to be tweaked to not reuse blocks for a certain 
amount of time, or until a certain free-space threshold is 
reached. Undelete might turn up stale data because NoFS 
does not support version consistency; the data block might 
have been part of an older version of the inode. 


4.3 Non-persistent allocation structures 

The allocation structures in ext2 are bitmaps and group 
descriptors. These structures are not persisted to disk in 
NoFS. In-memory versions of these data structures are 
built using the metadata scanner and data scanner. Statis- 
tics usually maintained in the group descriptors, such as 
the number of free blocks and inodes, are also maintained 
in their in-memory versions. 

Upon file-system mount, in-memory inode and block 
bitmaps are initialized to zero, signifying that every inode 
and data block is free. Since every block and inode has a 
backpointer, it can be determined to be in use by examin- 
ing its backlink or backpointer, and cross-checking with 
the inode mentioned in the backpointer. As every object 
is examined, consistent file-system state is built up and 
eventually complete knowledge of the system is achieved. 

In the file system, a block or inode that is marked free 
could mean two things: it is free, or it has not been ex- 
amined yet. Since all blocks and inodes are marked free 
at mount time, inodes need to be examined to check that 
they are indeed free; hence blocks or inodes that have not 
been examined yet cannot be allocated. In order to mark 
which inodes or blocks have been examined, we added a 
new bitmap each for inodes and data blocks called the va- 
lidity bitmap. If a block or inode has been examined and 
marked as free, it 1s safe to use it. Blocks not marked as 
valid could actually be used blocks, and hence must not 
be used for allocation. The examination of inodes and 
blocks are carried out by two background threads called 
the metadata scanner and data scanner. The two threads 
work closely together in order to efficiently find all the 
used inodes and blocks on disk. 


4.3.1 Metadata Scan 
Each inode needs to be examined in order to find out if it 
is in use or not. The backlinks in the inode are found, and 
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the directory blocks of the referred inodes are searched 
for a directory entry to this inode. Note that the directory 
hierarchy is not used for for the scan. The disk order of 
inodes is used instead, as this allows for fast sequential 
reads of the inode blocks. 

Once an inode is determined to be in use, its data blocks 
have to verified. This information is communicated to the 
data scanner by adding the data blocks of the inode to a 
list of data blocks to be scanned. The inode information is 
also attached to the list so that the data scanner can sim- 
ply compare the backpointer value to the attached value 
to determine whether the block is used. However, if the 
inode has indirect blocks, the inode data blocks are ex- 
plored and verified immediately. An inode with indirect 
blocks may contain thousands of data blocks, and it would 
be cumbersome to add all those data blocks to the list and 
process them later; hence inode data is verified immedi- 
ately by the metadata scanner. Each inode is marked valid 
after it has been scanned, allowing inode allocation to oc- 
cur concurrently with the metadata scan. 


4.3.2 Data Scan 
Observe that a data block is in use only if it is pointed to by 
a valid inode which is in use; hence only data blocks that 
belong to a valid inode need to be checked, which reduces 
the number of blocks that need to be checked drastically. 
The data block scanner works off a list of data blocks 
that the metadata scanner provides. Each list item also 
includes information about the inode that contained the 
data block. Therefore, the data scanner simply needs to 
read the inode off the disk and compare the backpointer 
inode to the inode information in the list item. The data 
block is marked valid after the examination is complete. 
Since the data scanner only looks at blocks referred 
to by inodes, there may be plenty of unexamined blocks 
which are not referred and potentially free. These blocks 
cannot be marked as valid and free until the end of the data 
scan, when all valid inodes have been examined. While 
the scan is running, the file system may indicate that there 
are no free blocks available, even if there are many free 
blocks in the system. In order to fix this, we implemented 
another scanner called the sequential block scanner which 
reads data blocks in disk order and verifies them one by 
one. This thread is only started if no free blocks are found, 
and the data scanner is still running. 


4.4 Limitations 
The design of NoFS involves a number of trade-offs. We 
describe the limitations that arise from our design choices. 
Recovery: NoFS was designed to be as lightweight as 
possible, avoiding heavy machinery for logging or copy- 
on-write. As a result, file-system recovery is limited. For 
example, consider a file that is truncated, and later writ- 
ten with new data. After a crash in the middle of these 
updates, the file may point to a block that it does not 
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own. This inconsistency is detected upon access to the 
data block. However, the version of the file which pointed 
to its old data cannot be recovered easily. By utilizing 
logging, a file system like ext3 provides the ability to pre- 
serve data in the event of a crash. 

Transactions: NoFS does not provide atomic transac- 
tions. Operations can be partially applied to different data 
structures. For example, if the file system crashes in the 
middle of a rename, it is possible that the file appears both 
in the old and new directories, as we do not validate direc- 
tory entries during a readdir. Though the user will be 
able to access the file via only one directory, the ‘old-or- 
new’ aspect of transactions is not provided. 

Accessing unverified objects: For large disks, it is 
possible that an object 1s accessed before the scan has 
verified it. Accessing such unverified objects involves a 
performance cost. The performance cost is felt during dif- 
ferent system calls for inodes and data blocks. 

Running the stat system call on an unverified inode 
may result in invalid information, as the number of blocks 
recorded in the inode may not match the actual number 
of blocks that belong to the inode on disk. In order to 
handle this, NoFS checks the inode status upon a stat 
call, and verifies the inode immediately if required, and 
then allows the system call to proceed. Since verification 
involves checking every data block referred to by the in- 
ode, the verification can take a lot of time. Running 1s 
—1 ona large directory of unverified files involves a large 
performance penalty arising from reading every file. For 
verified inodes, the stat will always return valid data, 
as the inode’s attributes are updated whenever an error is 
encountered on block access. Note that NoFS does not 
check directory entries for correctness. 

In the case of an unverified data block, no additional I/O 
is incurred during reads and partial writes since both in- 
volve reading the block off the disk anyway. However, in 
the case of a block overwrite, the block has to be read first 
to verify that it belongs to the inode before overwriting it. 
As aresult, a write in ext2 is converted into a read-modify- 
write in NoFS, effectively cutting throughput in half. It 
should be noted that this happens only on the first over- 
write of each unverified block. After the first overwrite, 
the block has been verified, and hence the backpointer no 
longer needs to be checked. 

Thus it can be seen that accessing unverified objects 
involves a large performance hit. However, these costs 
are only incurred during the window between file-system 
mount and scan completion. 


5 Evaluation 


We now evaluate NoFS in two categories: reliability and 
performance. For reliability testing, we artificially prevent 
writes to certain sectors from reaching the disk, and then 
observe how NoFS handles the resulting inconsistency. 





ext2 NoFS 
Ss . |B ag, 
2 oO 2 oO 
System call Blocks dropped | Error A < A < 
mkdir Cinode pPP c?B x _ “7 R CF! 
nkdir cdir CBD VCEP| V/ CEP 
mkdir Pe CC | & is lw RB 
mkdir Cinode cdir pPP CBD x _ a] cf! 
mkdir Cees pe || Cee x - | Y R 
mkdir os a Ca x - | Y R 
link Cinode CHL x _ oy CEN 
link per Cc xX = | R 
unlink Gece Ge ~ = | af C2" 
unlink ovr pe = | 47 ee 
rename Ne OPP x - ye Co 
rename ovr co a J R 
write Cis ce M = | 4? Cee 
write Cue Cee Mo | af 
write Cees Cerne | Coe “= la 8 
write Cre CoM || Coe ~ = lw? 
write cdata cind ceP x _ PA CFB 
delete-create | O7'" OPP x = yf Ce? 
truncate-write} 0*”°4¢ ofr x = Jf Or 
unlink-link | O77" OPP ee Ila ace 
General Key 
C Child inode _ File inode 
P Parent dir Directory block 
O Old file/parent data Data block 
N New file/parent ind Indirect block 
Key for Error Key for Action 





BD _ Bad dir entry R Block/inode reclaimed on scan 
OB Orphan block EI Error on inode access 

OI Orphan inode ED _ Error on data access 

HL ~~ Wrong hard link count EN _ Error on access via new path 
GD Garbage data EO _ Error on access via old path 
TP 2 inodes refer to 1 block | EB _ Error on block access 


Table 3: Reliability testing. The table shows how NoFS 
reacts to various inconsistencies that occur due to updates not 
reaching the disk. The behavior of ext2 is also shown. NoFS 
detects all inconsistencies and reports an error, while ext2 lets 
most of the errors pass by undetected. 


For performance testing, we evaluate the performance 
of NoFS on a number of micro and macro-benchmarks. 
We compare the performance of NoFS to ext2, an order- 
less file system with no consistency, and ext3 (in ordered 
mode), a journaling file system with metadata consistency. 


5.1 Reliability 


We test whether NoFS can handle inconsistencies caused 
by a file-system crash. When a crash happens, any sub- 
set of updates involved in a file-system operation could 
be lost. We emulate different system-crash scenarios by 
artificially restricting blocks from reaching the disk, and 
restarting the file-system module. The restarted module 
will see the results of a partially completed update on disk. 

We use a pseudo-device driver to prevent writes on tar- 
get blocks and inodes from reaching the disk drive. We 
interpose the pseudo-device driver in-between the file sys- 
tem and the physical device driver, and all writes to the 
disk drive go via the pseudo-device driver. The file sys- 
tem and the device driver communicate through a list of 
sectors. In the file system, we calculate the on-disk sec- 
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tors of target blocks and inodes and add them to the black 
list of sectors. All writes to these sectors are ignored by 
the device driver. Thus, we are able to target inodes and 
blocks in a fine grained manner. 

Table 3 lists the behavior of ext2 and NoFS when 20 
different inconsistencies are caused by dropping some of 
the blocks involved in each file-system operation. For ex- 
ample, consider the mkdir operation. It involves adding a 
directory entry to the parent directory, updating the new 
child inode, and creating a new directory block for the 
child inode. We do not consider updates to the access 
time of the parent inode. In the reliability test, we would 
drop writes to different combinations of these blocks, and 
observe the actions taken by the file system. For instance, 
if the write to the new child inode is dropped, it creates 
a bad directory entry in the parent directory, and orphans 
the directory block of the new child inode. We observe 
whether the file system detects this corrupt directory en- 
try, and whether the orphan block is reclaimed. Both these 
actions are performed successfully in NoFS, whereas ext2 
allows the user to access a garbage inode, and the block 
remains an orphan until the next file-system check. 

The table entries which have two system calls denote 
the second system call happening after the first system 
call. These particular combinations were selected because 
they share a common resource. For example, truncate- 
write explores the case when a data block is deleted from 
a file and reassigned to another file. If the write to the 
truncated file inode fails, both files now point to the same 
data block, leading to an inconsistency. Similarly unlink- 
link and delete-create may share the same inode. 

Some inconsistencies, like a corrupt directory block, 
are detected by ext2. Many other inconsistencies, such as 
reading garbage data, are not detected by ext2. All incon- 
sistencies are detected by NoFS, and an error is returned 
to the user. When blocks and inodes are orphaned due to a 
crash, they are reclaimed by NoFS when the file system is 
scanned for allocation information upon reboot. Some of 
the inconsistencies could lead to potential security holes: 
for example, linking a sensitive file for temporary access, 
and removing the link later. If the directory block is not 
written to disk, the file could still be accessed, providing 
a way to read sensitive information. These security holes 
are detected upon access in NoFS, and any operation on 
them leads to an error. 


5.2 Performance 

To evaluate the performance of NoFS, we run a series of 
micro-benchmark and macro-benchmark workloads. We 
also observe the performance of NoFS at mount time, 
when the scan threads are still active. We show that NoFS 
has comparable performance to ext2 in most workloads, 
and that the performance of NoFS is reasonable when the 
scan threads are running in the background. We also mea- 
sure the scan running time when the file system is popu- 
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Figure 4: Micro-benchmark performance. This 
figure compares file-system performance on various micro- 
benchmarks. The sequential benchmarks involve reading and 
writing a 1 GB file. The random benchmarks involve 10K ran- 
dom reads and writes in units of 4088 bytes (4096 bytes - & byte 
backpointer) across a I GB file, with a fsync after 1000 writes. 
The creation and deletion benchmarks involve 100K files spread 
over 100 directories, with a fsync after every create or delete. 


lated with data, the rate at which NoFS scans data blocks 
to find free space, and the performance cost incurred when 
the stat system call is run on unverified inodes. 

Our experiments were performed on a machine with a 
AMD 1 Ghz Opteron processor, and 1 GB of memory run- 
ning Linux 2.6.27.55. The disk drive used in the experi- 
ment was a Seagate Barracuda 160 GB, which provides 75 
MB/s read throughput and 70 MB/s write throughput. All 
experiments were performed on a cold file-system cache. 
The experiments were stable and repeatable. The numbers 
reported are the average over 10 runs. 


5.2.1 Micro-benchmarks 

We run a number of micro-benchmarks, focusing on dif- 
ferent operations like sequential write and random read. 
Figure 4 illustrates the performance of NoFS on these 
workloads. We observe that NoFS has minimal overhead 
on the read and write workloads. For the sequential write 
workload, the performance of ext3 is worse than ext2 and 
NoFS due to the journal writes that ext3 performs. 

The creation and deletion workloads involve doing a 
large number of creates/deletes of small files followed by 
fsync. This workload clearly brings out the performance 
penalty due to ordering points. The throughput of NoFS 
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Figure 5: Macro-benchmark performance. 
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The figure shows the throughput achieved on various application workloads. 


The sort benchmark is run on 500 MB of data. The varmail benchmark was run with parameters 1000 files, [OOK mean dir width, 
16K mean file size, 16 threads, 16K I/O size and 16K mean append size. The file and webserver benchmarks were run with the 
parameters 1000 files, 20 dir width, 1 MB I/O size and 16K mean append size. The mean file size was 128K for the fileserver 
benchmark and 16K for the webserver benchmark. Fileserver benchmark used 50 threads while webserver used 100 threads. 


is twice that of ext3 on the file creation micro-benchmark, 
and 70% higher than ext3 on the file deletion benchmark. 


5.2.2 Macro-benchmarks 

We run the sort and Filebench [8] macro-benchmarks to 
assess the performance of NoFS on application work- 
loads. Figure 5 illustrates the performance of the three 
file systems on this macro-benchmark. We selected the 
sort benchmark because it is CPU intensive. It sorts a 
500 MB file generated by the gensort tool [22], using the 
command-line sort utility. The performance of NoFS is 
similar to that of ext2 and ext3, demonstrating that NoFS 
has minimal CPU overhead. 

We run three workloads on Filebench: fileserver, web- 
server, and varmail. The fileserver workload emulates 
file-server activity, performing a sequence of creates, 
deletes, appends, reads, and writes. The webserver work- 
load emulates a multi-threaded web host server, perform- 
ing sequences of open-read-close on multiple files plus a 
log file append, with 100 threads. The varmail workload 
emulates a multi-threaded mail server, performing a se- 
quence of create-append-sync, read-append-sync, reads, 
and deletes in a single directory. 


We believe these benchmarks are representative of the 
different kind of I/O workloads performed on file sys- 
tems. The performance of NoFS matches ext2 and ext3 on 
all three workloads. NoFS outperforms ext3 by 18% on 
the varmail benchmark, demonstrating the performance 
degradation in ext3 due to ordering points. 


5.2.3 Scan performance 

We evaluate the performance of NoFS at mount time, 
when the scanner is still scanning the disk for free re- 
sources. The scanner is configured to run every 60 sec- 
onds, and each run lasts approximately 16 seconds. In or- 
der to understand the performance impact due to scanning, 
we do two experiments involving 10 sequential writes of 
200 MB each. The writes are spaced 30 seconds apart. 

In the first experiment, we start the writes at mount 
time. The scanning of the disk and the sequential write 
is interleaved at Os, 60s, 120s, and so on, leading to the 
write bandwidth dropping to half. When the sequential 
writes are run at 30s, 90s, 150s, and so on, the writes 


achieve peak bandwidth. In the second experiment, the 
writes were once again spaced 30s apart, but were started 
at 20s, after the end of the first scan run. In this experi- 
ment, the writes are never interleaved with the scan reads, 
and hence suffer no performance degradation. Graph (a) 
in Figure 6 illustrates these results. 


Once the scan finishes, writes will once again achieve 
peak bandwidth. Running the scan runs without a break 
causes the scan to finish in around 90 seconds on an empty 
file system. Of course, one can configure this trade-off as 
need be; the larger the interval between scans, the smaller 
the performance impact during this phase, but the longer 
it takes to fully discover the free blocks of the system. 


Graph (b) in Figure 6 depicts the time taken to finish 
the scan (both metadata and data) when the file system 
is increasingly populated with data. In this experiment, 
the scan is run without a break upon file-system mount. 
All the data in the file system are in units of 1 MB files. 
The running time of the scan increases slowly when the 
amount of data in the file system is increased, reaching 
about 140s for 1 GB of data. We also performed an exper- 
iment where we created a variable number of empty files 
in the file systems and measured the time for the scan to 
run. We found that the time taken to finish the scan re- 
mained the same irrespective of the number of empty files 
in the system. Since every inode in the system is read and 
verified, irrespective of whether it is actively used in the 
file system or not, the scan time remains constant. 


During a file write, if there are no free blocks, the se- 
quential block scanner is invoked in order to scan data 
blocks and find free space. The write will block until free 
space is found. Graph (c) illustrates the performance of 
the sequential block scanner. The latency to scan 100 MB 
is around 3 seconds, and | GB of data is scanned in around 
30 seconds. The throughput is currently around 30 MB/s, 
so there is opportunity for optimizing its performance. 


As mentioned in Section 4.4, when stat is run on an 
unverified inode, NoFS first verifies the inode by check- 
ing all its data blocks. We ran an experiment to estimate 
the cost of such verification. We created four identical di- 
rectories, each filled with a number of 1 MB files. Every 
140 seconds, 1s -—1i was run on one directory, leading 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


111 


it 


Effect of background scan on write bandwidth over time 








80 
a, ihe Be agieke 
60 
50 
A 
© 40 
= 
30 
20 ; 
——=+ Writes every 30s, start at Os 
10 ‘A, — Writes every 30s, start at 20s 
0 
0 30 60 90 120 150 180 210 240 270 300 
Time(s) 
(a) 
Time taken to scan data blocks 
100 
10 


— 


Running time(s) 
5 


0.001 
1000 


1 10 


100 
Total data scanned(MB) 
(C) 


Effect of file-system data on scan running time 


@ 120 


Running time 





1 2 4 8 16 32 64 


Total data(MB) 
(b) 


128 256 512 1024 


Performance cost of stat on unverified inodes 
50 
40 ~ 


——— 128MB 
se Niteives Ae 256 MB 
© —© 512MB 
30 


Scan completion 


Is time(s) 


*e, \ 
2 0 *e, 
%. 
% 
% 





140 210 250 280 420 


350 


Time(s) 


(d) 


Figure 6: Scan performance. Figure (a) depicts the reduction in write bandwidth when sequential writes interleave with the 
background scan. Figure (b) shows that the running of the scan increases slowly with the amount of data in the file system. Figure 
(c) illustrates the rate at which data blocks are scanned. Figure (d) demonstrates the performance cost incurred when the stat 


system call is run on unverified inodes. 


toa stat oneach inode in the directory. The background 
scan started at file-system mount and finished at approx1- 
mately 250 seconds. We varied the number of files from 
128 to 512 and measured the time taken for 1s -—1i in 
each experiment. Graph (d) illustrates the results. As ex- 
pected, the time taken for 1s to complete increases with 
the total data in the directory. After the scan completion 
at 250 seconds, all the inodes are verified, and hence 1s 
finishes almost instantly. 


6 Discussion 


We have demonstrated that NoFS has better performance 
than journaling file systems such as ext3, while provid- 
ing better consistency guarantees. However, it should be 
noted that NoFS differs from ext3 in two important as- 
pects. First, it does not provide atomic transactions. Sec- 
ond, NoFS has no redundancy anywhere in the system. 
Part of the reason ext3 performs worse than NoFS is its ex- 
tra log writes. By writing transaction updates to a log first, 
ext3 provides both metadata consistency, and the ability 
to preserve old data if the transaction fails before commit. 
NoFS only provides the former. 

Given its current design, we feel an excellent use-case 
for NoFS would be as the local file system of a distributed 
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file system such as the Google File System [11] or the 
Hadoop File System [37]. In such a distributed file sys- 
tem, reliable detection of corruption is all that is required, 
since redundant copies of data would be stored across the 
system. If the master controller is notified that a particular 
block has been corrupted in the local file system of a par- 
ticular node, it can make additional copies of the data in 
order to counter the corruption of the block. Furthermore, 
such distributed file systems typically have large chunk 
sizes. As shown in section 5, NoFS provides very good 
performance on large sequential reads and writes, and is 
well suited for such workloads. 


It should be noted that backpointer-based consistency 
could also be used to help ensure integrity in a conven- 
tional file system against bugs or data corruption. The 
simplicity and low overhead of backpointers makes such 
an addition to an existing file system feasible. 


By eliminating ordering, backpointer-based consis- 
tency allows the file system to maintain consistency with- 
out depending upon lower-layer primitives such as the 
disk cache flush. Previous research has shown that SATA 
drives do not always obey the flush command [29, 36], 
which is essential for file systems to implement ordering. 
IDE drives have also been known to disobey flush com- 
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mands [28, 39]. Using backpointer-based consistency al- 
lows a file system to run on top of such misbehaving disks 
and yet maintain consistency. 

Potential users of NoFS should note two things. One, 
any application which requires strict ordering among file 
creates and writes should not use NoFS. Two, if there are 
corrupt files in the system, NoFS will only detect them 
upon access and not upon file-system mount. Some users 
may prefer to find out about corruption at mount time 
rather than when the file system is running. Such a use 
case aligns better with a file system such as ext3. 


7 Related Work 


The idea of using information inside or near the block to 
detect errors is not new. Cambridge File Server [7] used 
certain bits in each cylinder (cylinder map) to store the 
allocation status of blocks in that cylinder. Cedar File 
System [12] used ‘labels’ inside pages to check their al- 
location status. Embedding logical identity of blocks (in- 
ode number + offset) has been done in RAID to recover 
from lost and misdirected writes [16]. Transactional flash 
[27] embeds commit records inside every page to pro- 
vide transactions and recovery. However, NoFS is the first 
work that we know of that clearly defines the level of con- 
sistency that such information provides and uses such in- 
formation alone to provide consistency. 

The design of the Pilot file system [30] is very simi- 
lar to that of NoFS. Pilot employs self identifying pages 
and uses a scavenger to reconstruct the file system meta- 
data upon crash. However, like the file-system check, the 
scavenger needs to finish running before the file system 
can be accessed. In NoFS, the file system is made avail- 
able upon mount, and can be accessed while the scan is 
running in the background. 

Pangaea [33] uses backpointers for consistency in a dis- 
tributed wide area file system. However, its use of back- 
pointers is limited to directory entry backpointers that are 
used to resolve conflicting updates on directories. Simi- 
lar to NoFS, Pangaea also uses the backpointer as the true 
source of information, letting the backpointers of child in- 
odes dictate whether they belong to a directory or not. 

btrfs [48] supports back references that allow it to ob- 
tain the list of the extents that refer to a particular ex- 
tent. Although back references are conceptually similar 
to NoFS backpointers, the main purpose of btrfs back ref- 
erences is supporting efficient data migration, rather than 
providing consistency. Other mechanisms such as check- 
sums are used to ensure that the data is not corrupt in btrfs. 
Another key difference is that btrfs does not always store 
the back reference inside the allocated extent: sometimes 
the back references are stored as separate items close to 
the extent allocation records. 

Backlog [17] also uses explicit back references in or- 
der to manage migration of data in write anywhere file 


systems. The back references in Backlog are stored in a 
separate database, and are designed for efficient querying 
of usage information rather than consistency. Backlog’s 
back references are not used for incremental file-system 
checking or resolving ownership disputes. 

While NoFS makes an order-less file system more 
available by eliminating the need for the file-system 
check, there have been other approaches to increasing 
availability such as doing the file-system check while the 
system is online. McKusick’s background fsck [19] could 
repair simple inconsistencies such as lost resources by 
running fsck on snapshots of a running system. Chunkfs 
[14] is similar to our work, providing incremental, online 
file-system checking. Chunkfs differs from NoFS in that 
the minimal unit of checking is a chunk whereas it is a sin- 
gle file or block in NoFS. Chunkfs does not offer online 
repair of the file system, while it is possible in NoFS, due 
to backpointers and non-persistent allocation structures. 


$8 Conclusion 


Every modern file system uses ordering points to ensure 
consistency. However, ordering points have many disad- 
vantages including lower performance, higher complexity 
in file-system code, and dependence on lower layers of the 
storage stack to enforce ordering of writes. 

In this paper, we demonstrate that it is possible to build 
an order-less file system, NoFS, that provides consistency 
without sacrificing simplicity, availability or performance. 
NoFS allows immediate data access upon mounting, with- 
out file-system checks. We show that NoFS has excellent 
performance on many workloads, outperforming ext3 on 
workloads that frequently flush data to disk explicitly. 

Although potentially useful for the desktop, we believe 
NoFS may be of special significance in cloud computing 
platforms, where many virtual machines are multiplexed 
onto a physical device. In such cases, the underlying host 
operating system may try to batch writes together for per- 
formance, potentially ignoring ordering requests from vir- 
tual machines. NoFS allows virtual machines to maintain 
consistency without depending on the numerous lower 
layers of software and hardware. Removing such trust is 
key to building more robust and reliable storage systems. 
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Abstract 


In NAND flash memory, once a page program or block 
erase (P/E) command is issued to a NAND flash chip, 
the subsequent read requests have to wait until the time- 
consuming P/E operation to complete. Preliminary re- 
sults show that the lengthy P/E operations may increase 
the read latency by 2x on average. As NAND flash- 
based SSDs enter the enterprise server storage, this in- 
creased read latency caused by the contention may sig- 
nificantly degrade the overall system performance. In- 
spired by the internal mechanism of NAND flash P/E al- 
gorithms, we propose in this paper a low-overhead P/E 
suspension scheme, which suspends the on-going P/E to 
service pending reads and resumes the suspended P/E 
afterwards. In our experiments, we simulate a realistic 
SSD model that adopts multi-chip/channel and evaluate 
both SLC and MLC NAND flash as storage materials of 
diverse performance. Our experimental results show that 
the proposed technique achieves a near-optimal perfor- 
mance gain on servicing read requests. Specifically, the 
read latency is reduced on average by 50.5% compared 
to RPS and 75.4% compared to FIFO at cost of less than 
4% overhead on write requests. 


1 Introduction 


NAND flash-based SSDs have better random access per- 
formance over hard drives and have potential in high per- 
formance computing system market. However, NAND 
flash has performance and cost problems which limit its 
application [11]. The problem addressed in this paper 
is the read vs. program/erase (P/E) contention. Due to 
slow P/E speed of NAND flash, once P/E is committed 
to the flash chip, pending or subsequent read requests 
suffer from the prolonged service latency caused by the 
waiting time. As disk read requests are resulted from 
upper level cache misses, the compromised read latency 
of the disk causes degraded application performance. To 
reduce read latency, on-disk write buffers may avoid or 
postpone the write commitments to the flash [9, 6, 7]. Ex- 
ecuting the garbage collection processes during the idle 
time of the drive may also alleviate the contention be- 
tween read and P/E [1, 10]. Furthermore, the read re- 


quests can be prioritized in a pending list to reduce the 
queuing time caused by the P/E. However, none of these 
approaches preempt the committed P/E for read requests. 

To address this read vs. P/E contention problem, we 
propose a P/E Suspension scheme for NAND flash that 
allows the execution of the P/E operations to be sus- 
pended so as to service the pending reads and then the 
suspended P/E is resumed. The internal process of the 
program operation is done in a “step-by-step” fashion 
(Incremental Step Pulse Programming, or ISPP [2]), and 
thus the program can be suspended at the interval of two 
consecutive steps, or the on-going step could be canceled 
and re-executed upon resumption. The erase process re- 
quires the duration of erase-voltage pulse to be satisfied, 
and thus the erase can also be suspended and resumed as 
long as we ensure the required timing. 

The implementation of P/E suspension for NAND 
flash involves minimal modifications to the flash inter- 
face, 1.e., merely the “program suspend/resume” and 
“erase suspend/resume”’ commands need to be added in 
the command set of the flash interface [12]. To support 
P/E suspension, the control logic inside the flash chip 1s 
required to determine the appropriate time to suspend 
the P/E (suspension point) and to maintain or retrieve 
the previous state of the suspended P/E so as to resume 
it. Noting that the implementation feasibility of the pro- 
posed schemes is based on the fundamental/typical cir- 
cuitry of flash memories [3]. 

This paper makes the following contributions. First, 
we analyze the impact of the long P/E latency on read 
performance, showing that even with the read prioritiza- 
tion scheduling, the read latency is still severely compro- 
mised. Second, by exploiting the internal mechanism of 
the P/E algorithms in NAND flash memory, we propose 
a low-overhead P/E suspension scheme which suspends 
the on-going P/E operations for servicing the pending 
read requests. In particular, two strategies for suspend- 
ing the program operation, Inter Phase Suspension (IPS) 
and Intra Phase Cancelation(IPC) are proposed. Third, 
based on simulation experiments under various work- 
loads, we demonstrate that compared to FIFO, the pro- 
posed design can significantly reduce the SSD read la- 
tency for both SLC and MLC NAND flash. 
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The rest of this paper is organized as follows: In Sec- 
tion 2 we give an overview of the internal mechanism for 
P/E on NAND flash and briefly discuss related work. In 
Section 3, we conduct simulations to show how the read 
latency is increased by chip contention. We describe our 
detailed P/E suspension scheme in Section 4 and evalu- 
ate our approach via simulation experiments in Section 5. 
Finally we conclude our paper in Section 6. 


2 Background and Related Work 


2.1 NAND Flash Program/Erase Algorithm 


Incremental Step Pulse Programming (ISPP) is typically 
used for precisely programming or erasing the NAND 
flash [3]. It 1s made of a series of program and verify 
iterations. The execution of ISPP and the erase process 
is implemented in the flash chip with an analog block 
and a control logic block. The analog block is responsi- 
ble for regulating and pumping the voltage for program 
or erase operations. The control logic block is responsi- 
ble for interpreting the interface commands, generating 
the control signals for the flash cell array and the analog 
block, and executing the program and erase algorithms. 
As shown in the following diagram [3], the write state 
machine consists of three components: an algorithm con- 
troller to execute the algorithms for the two types of op- 
erations, several counters to keep track of the number of 
ISPP iterations, and a status register to record the results 
from the verify operation. 


Write State Machine 





Counters 





Algorithm 
Controller 





Status From Flash 
Register Array 




















Command 
Interface To Analog Block 


and Flash Array 
seauencl 


2.2 Related Work 














The idea of preempting low priority operations for high 
priority ones via breaking down an operation to small 
phases has been embodied in [4], [13], etc. Dimitriyevic 
et al. proposed Semi-preemptible IO [4] to divide HDD 
I/O requests to small disk commands to enable preemp- 
tion for high priority requests. Similar to NAND flash, 
Phase Change Memory (PCM) has much larger write la- 
tency than read latency. Qureshi et al. proposed in [13] 
a few techniques to preempt the on-going writes of PCM 
for reads: write cancelation and a threshold-based over- 
head control method to reduce the overhead are proposed 
to cancel entire write operations; PCM, like NAND flash, 
adopts the iterative-write algorithm. Our work differs 
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from [13] as follows: PCM has the in-place update capa- 
bility, while NAND flash requires erase before program. 
In our work, the suspension of erase operation is pro- 
posed. Write Cancelation for the entire write process of 
NAND flash is not viable. NAND flash’s iterative write 
process differs from PCM in that, each iteration has two 
phases (program and verify). Thus, for each iteration, 
we may have two suspension points. Furthermore, we 
propose the shadow buffer to overcome the overhead of 
re-transferring the write data upon resumption, which is 
not discussed in [13]. 


3 Motivation 


In this section, we demonstrate how the read vs. P/E con- 
tention increases the read latency under various work- 
loads. We have modified MS-add-on simulator [1] based 
on Disksim 4.0. Specifically, under the workloads of a 
variety of popular disk traces, we compare the read la- 
tency of two scheduling policies, FIFO and read priority 
scheduling (RPS), to show the limitation of RPS. Fur- 
thermore, with RPS, we set the latency of program and 
erase operation to be equal to that of read and zero to 
justify the impact of P/E on the read latency. 


3.1 Configurations and Workloads 


The simulated SSD is configured as follows: there are 16 
flash chips, each of which owns a dedicated channel to 
the flash controller. Each chip has four planes that are 
organized in a RAID-O fashion; the size of one plane is 
512 MB or 1 GB assuming the flash is used as SLC or 2- 
bit MLC, respectively (the page size is 2 KB for SLC or 
4 KB for MLC). To maximize the concurrency, each indi- 
vidual plane has its own allocation pool [1]. The garbage 
collection processes are executed in the background so 
as to minimize the interference with the foreground re- 
quests. In addition, the percentage of flash space over- 
provisioning is set as 30%, which doubles the value sug- 
gested in [1]. Considering the limited working-set size of 
the workloads used in this paper (described in next sub- 
section), 30% over-provisioning is believed to be suffi- 
cient to avoid frequent execution of garbage collection 
processes. The write buffer size is 64 MB. The SSD is 
connected to the host via a PCI-E of 2.0 GB/s. The phys- 
ical operating parameters of the flash memory is summa- 
rized in Table 1. 

We choose 6 disk I/O traces for our experiments: Fi- 
nancial I and 2 (Fl, F2) [14]; Display Ads Platform 
and payload servers (DAP) and MSN storage meta- 
data (MSN) traces [8]; Cello99 [5] traces (C3, C8). Not- 
ing that those traces were originally collected on HDDs, 
to produce more stressful workloads for SSDs, we com- 
press all these traces so that the system idle time is re- 
duced from 98% to around 70% for each workload. 
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Table 1: Flash Parameters 


oo 
Trace | Rag | Wate Read | Wie 
FI_|| 037 | 087 | 044 | 158 
F2_|| 024 | 057 _| 027 | 1.03 


DAP || 192 [685 [5.74 _[ 1174 
MSN [| 4.13 [458 | 847 | 25.21 
C3 _ | 025_| 2.85 [052 _| 6.30 
[cs] 044 [233 | 056 | 454 





Table 2: Numerical Latency Values of FIFO (in ms) 


3.2 Experimental Results 


In this subsection, we compare the read latency perfor- 
mance under four scenarios: FIFO; RPS; PER (the la- 
tency of program and erase is set equal to that of read); 
and PEO (the latency of program and erase is set to zero). 
Note that both PER and PEO are applied upon RPS in 
order to study the chip contention and the limitation of 
RPS. Due to the large range of the numerical values of 
the experimental results, we normalize them to the corre- 
sponding results of FIFO, which are listed in Table 2 for 
reference. The normalized results are plotted in Fig. 1, 
where the left part shows the results of SLC and the right 
part is for MLC. Compared to FIFO, RPS achieves im- 
pressive performance gain, e.g., the gain maximizes at an 
effective read latency (“effective”’ refers to the actual la- 
tency taking the queuing delay into account) reduction of 
44.6% (SLC) and 48.3% (MLC) on average. However, if 
the latency of P/E is the same as read latency or zero, 1.e., 
in the case of PER and PEO, the effective read latency 
can be further reduced. For example, with PEO, the read 
latency reduction is 71.7% (SLC) and 75.6% (MLC) on 
average. Thus, even with RPS policy, the chip contention 
still increases the read latency by about 2x on average. 
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Figure 1: Read Latency Performance Comparison: 
FIFO, RPS, PER, and PEO. Results normalized to FIFO. 


4 Design of P/E Suspension Scheme 


4.1 Erase Suspension and Resumption 


In NAND flash, the erase process consists of two phases: 
first, an erase pulse lasting for Tease 18 applied on the 
target block; second, a verify operation that takes Tye,i sy 
is performed to check if the preceding erase pulse has 
successfully erased all bits in the block. Otherwise, the 
above process is repeated until success, or if the number 
of iterations reaches the predefined limit, an operation 
failure is reported. Typically, for NAND flash, since the 
over-erasure 1s not a concern [3], the erase operation can 
be done with a single erase pulse. 

How to suspend an erase operation: suspending e1- 
ther the erase pulse or verify operation requires resetting 
the status of the corresponding wires that connect the 
flash cells with the analog block. Specifically, due to the 
fact that the flash memory works at different voltage bias 
for different operations, the current voltage bias applied 
on the wires (and thus on the cell) needs to be reset for 
the pending read request. This process (OPyoirage_reset for 
short) takes a period of Tyoirage_ reset. Noting that either 
the erase pulse or verify operation always has to conduct 
OPvoltage_reset at the end (as shown in the following dia- 
gram of erase operation timeline). 


: Immediate Suspension Range !Tyoitage reset |! mMmediate Suspension Range! Tyoitage reget 
A D ' 4 oe D ' I L ' 4 = L ' 











( Erase Pulse 1 Verify 3 ) 
Terase T verify 
Thus, if the suspension command arrives dur- 


Ing OPyoltage_reset, the suspension will succeed once 
OPvoltage_reset 1S finished (as illustrated in the following 
diagram of erase suspension timeline). 
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Otherwise, an OPyoltage_reset 1S executed immediately 
and then the read request is serviced by the chip (as illus- 
trated in the following diagram). 


Tyottage_ reset | 
, ' 








for “3” (the erased state), no ISPP iteration is needed. 
Since all flash cells in the page are programmed simul- 
taneously, Ny cycle 1s determined by the smallest data (2- 
bit) to be written; nonetheless, we make a rational as- 
sumption in our simulation experiments that Ny cycle 1S 


constant and equal to the maximum value. The program 
process is illustrated in the following diagram. 
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How to resume an erase operation: the resump- gv dw_nroaran, Twenty ! 
tion means the control logic of NAND flash resumes S Bus) rogram A vest | promram} vert) aie 
the suspended erase operation. Therefore, the control Tw. eyete J 


logic should keep track of the progress, i.e., whether the 
suspension happens during the verify phase or the erase 
pulse. For the first scenario, the verify operation has to 
be re-done all over again. For the second scenario, the 
erase pulse time left (7,qse minus the progress), for ex- 
ample, 1 ms will be done in the resumption if no more 
suspension happens. Actually, the task of progress track- 
ing can be easily supported by the existing facilities in 
the control logic of NAND flash: the pulse width gener- 
ator is implemented using a counter-like logic [3], which 
keeps track of the progress of the current pulse. 

The overhead on the effective erase latency: resum- 
ing the erase pulse requires extra time to set the wires 
to the corresponding voltage bias, which takes approx- 
imately the same amount of time as Tyottage reset- SUS- 
pending during the verify phase causes a re-do in the re- 
sumption, and thus the overhead is the time of the sus- 
pended/cancelled verify operation. In addition, the read 
service time is included in the effective erase latency. 


4.2 Program Suspension and Resumption 


The process of servicing a program request is: first, the 
data to be written is transferred through the controller- 
chip bus and loaded in the page buffer; then the ISPP is 
executed, in which a total number of Ny cycie iterations 
consisting of a program phase followed by a verify phase 
are conducted on the target flash page. In each ISPP iter- 
ation, the program phase is responsible for applying the 
required program voltage bias on the cells so as to charge 
them. In the verify phase, the content of the cells is read 
to verify if the desired amount of charge is stored in each 
cell: if so, the cell is considered program-completion; 
otherwise, one more ISPP iteration will be conducted on 
the cell. Due to the fact that all cells in the target flash 
page are programmed simultaneously, the overall time 
taken to program the page is actually determined by the 
cell that needs the most number of ISPP iterations. A ma- 
jor factor that determines the number of ISPP iterations 
needed is the amount of charge to be stored in the cell, 
which is in turn determined by the data to be written. For 
example, for the 2-bit MLC flash, programming a “0” in 
a cell needs the most number of ISPP iterations, while 
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How to retain the page buffer content: before we 
move on to suspension, this critical problem has to be 
solved. For program, the page buffer contains the data 
to be written. For read, it contains the retrieved data to 
be transferred to the flash controller. If a write is pre- 
empted by a read, the content of the page buffer is cer- 
tainly replaced. Thus, the resumption of the write de- 
mands the page buffer re-stored. Intuitively, the flash 
controller that is responsible for issuing the suspension 
and resumption commands may keep a copy of the write 
page data until the program is finished and upon resump- 
tion, the controller re-sends the data to the chip through 
the controller-chip bus. However, the page transfer con- 
sumes a significant amount of time: unlike the NOR flash 
which does byte programming, NAND flash does page 
programming, and the page size is of a few kilobytes. 
For instance, assuming a 100 MHz bus and 4 KB page 
size, the bus time 7, 1s about 40 Ls. 

To overcome this overhead, we propose a Shadow 
Buffer in the flash. The shadow buffer serves like a 
replica of the page buffer and it automatically loads it- 
self with the content of the page buffer upon the arrival 
of the write request and re-stores the page buffer while 
resumption. The load and store operation takes the time 
Thuf fer. The shadow buffer has parallel connection with 
the page buffer, and thus the data transfer between them 
can be done on the fly. Th, ffe, 1s normally smaller than 
Thys by one order of magnitude. 

How to suspend a program operation: compared to 
the long width of the erase pulse (7z;ase), the program and 
verify phase of the program process is normally two or- 
ders of magnitude shorter. Intuitively, the program pro- 
cess can be suspended at the end of the program phase 
of any ISPP iteration as well as the end of the verify 
phase. We refer to this strategy as “Inter Phase Suspen- 
sion” (IPS). IPS has in total Ny cycie * 2 potential suspen- 
sion points as illustrated in the following diagram. 
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Due to the fact that at the end of the program or 
verify phase, the status of the wires has already re- 
set (OPyolrage_reset), IPS does not introduce any extra 
overhead, except for the service time of the read or 
reads that preempt the program. However, the effec- 
tive read latency should include the time from the ar- 
rival of read to the end of the corresponding phase. 
For simplicity, assuming the arrival time of reads fol- 
lows the uniform distribution, the probability of en- 
countering the program phase and the verify phase 
is Le preeand eens oe Teresa and Toa een ah 
Ty_program), tespectively. Thus, the average extra latency 
for the read can be calculated as: 


Tw_pro gram Tyw_ program 


Tead_extra = (Hyerify+Tw_program) (1) 


Tverify 4. Lverify 
(Tyerify +Iw_program) 
Substituting the numerical values in Table 1, we get 
8.29 us (SLC) and 11.09 ws (MLC) for Thread extra, 
which is comparable to the physical access time of the 
read (T;- pny). To further improve the effective read la- 
tency, we propose “Intra Phase Cancelation” (IPC). Sim- 
ilar to canceling the verify phase for the erase suspen- 
sion, IPC cancels an on-going program or verify phase 
upon suspension. The reason of canceling instead of 
pausing the program phase is that the duration of the pro- 
gram phase, 7), program, 1S Short and normally considered 
atomic (cancelable but not pause-able). 

Again, for IPC, if the read arrives when the program or 
verify phase is conducting OPyoltage_reset, the suspension 
happens actually at the end of the phase, which is the 
same as IPS; otherwise, OPyoitage_reset 18 Started immedi- 
ately and the read is then serviced. Thus, IPC achieves a 
T ead _extra no larger than Tyoltage_reset- 

How to resume from IPS: first of all, the page buffer 
is re-loaded with the content of the shadow buffer. Then, 
the control logic examines the last ISPP iteration number 
and the previous phase. If IPS happens at the end of the 
verify phase, which implies that the information of the 
status of cells has already been obtained, we may con- 
tinue with the next ISPP if needed; on the other hand, if 
the last phase is the program phase, naturally we need to 
finish the verify operation before moving on to the next 
ISPP iteration. The resumption process is illustrated in 
the following diagram. 


Thutter 
e———_ 





Read 
Resumption Point —’ 


Program/Verify } 





How to resume from IPC: compared to IPS, the re- 
sumption from IPC is more complex. Different from the 
verify operation, which does not change the charge status 
of the cell, the program operation puts charge in the cell 
and thus changes the threshold voltage (V;;,) of the cell. 


| Buffer J Verify/Progra m) Re ee ee eee 


Therefore, we need to determine whether the canceled 
program phase has already achieved the desired V;,, (1.e., 
whether the data could be considered written in the cell), 
by a verify operation. If so, no more ISPP iteration is 
needed on this cell; otherwise, the previous program op- 
eration is executed on the cell again. The later case is 
illustrated in the following diagram. 





CRraprach.\ Read ‘Buffer Verify \ Re-do PROG \ Verify )--- 


Resumption Point —’ 


Re-doing the program operation would have some af- 
fect on the tightness of V;,, but with the aid of ECC 
and a fine-grained ISPP, 1.e., small incremental voltage 
AV pp, the IPC has little impact on the data reliability of 
the NAND flash. The relationship between AV, and the 
tightness of V;,, 1s modeled in [15]. 

The overhead on the effective write latency: IPS re- 
quires re-loading the page buffer, which takes 7), f¢,. 
For IPC, if the verify phase is canceled, the overhead is 
the time elapsed of the canceled verify phase plus the 
read service time and 7),,¢er- In case of program phase, 
there are two scenarios: if the verify operation reports 
that the desired V;, 1s achieved, the overhead is the read 
service time plus 7),,¢fer; otherwise, the overhead is the 
time elapsed of the canceled program phase plus an ex- 
tra verify phase, in addition to the overhead of the above 
scenario. Clearly, IPS achieves smaller overhead on the 
write than IPC but relatively lower read performance. 


5 Performance Evaluation 


In this section, we evaluate our proposed design under 
different workloads described in Section 3.1. 


5.1 Read Performance Gain 


First, we compare the average read latency of P/E sus- 
pension with RPS, PER and PEO in Fig. 2, where the 
results are normalized to that of RPS. For P/E sus- 
pension, the IPC (Intra Phase Cancelation), denoted as 
“PES IPC’, is adopted in Fig. 2. PEO, with which the 
physical latency values of program and erase are set to 
zero, serves aS an optimistic situation where the con- 
tention between reads and P/E’s is completely elimi- 
nated. Fig. 2 demonstrates that, compared to RPS, the 
proposed P/E suspension achieves a significant read per- 
formance gain, which is almost equivalent to the optimal 
case, PEO (with less than 1% difference). Specifically, on 
the average of the 6 traces, PES_IPC reduces the read la- 
tency by 48.9% for SLC and 50.5% for MLC compared 
to RPS, and 71.6% for SLC and 75.4% for MLC com- 
pared to FIFO. For conciseness, the results of SLC and 
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(then) MLC are listed without explicit specification in the 
following text. 
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Figure 2: Read Latency Performance Comparison: RPS, 
PER, PEO, and PES_IPC (P/E Suspension using IPC). 
Normalized to RPS. 


As stated in Section 4, IPC can achieve better read per- 
formance but cause higher write overhead compared to 
IPS. We compare the read performance of IPC and IPS 
in Fig. 3. The read latency of IPS is 8.0% and 2.7% on 
average and at-most 13.2% and 6.7% (under F1) higher 
than that of IPC. The difference is resulted from the fact 
that IPS has extra read latency, which is mostly the time 
between read request arrivals and the suspension points 
at the end of the program or verify phase. We notice that 
the latency performance of IPS using SLC 1s poorer than 
MLC under all traces, which is because of the higher sen- 
sitivity of SLC’s read latency to the overhead caused by 
the extra latency. 


Mm PES IPC ™ PES_IPS 
SLC MLC 


o 
c 
2 
Ss 
To 
© 
VO 
a 
To 
8 
= 
E 
oO 
2 
Fl F2 DAP MSN C3 C8 Fl F2 DAP MSN C3 C8 
Figure 3: Read Latency Performance Comparison: 


PES_IPC vs. PES_IPS. Normalized to PES_IPC. 


5.2 Write Overhead 


Since both RPS and P/E suspension introduce mini- 
mal extra chip bandwidth usage, the write throughput 
is barely compromised. We use the latency as a metric 
for the overhead evaluation. First, we compare the aver- 
age write latency of FIFO, RPS, PES_IPS, and PES_IPC 
in Fig. 4. Obviously, the write overhead in terms of la- 
tency is trivial compared to the read performance gain we 
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achieve with P/E suspension. Specifically, RPS increases 
the write latency by 2.3% and 1.2% on average and at- 
most 6.7% (SLC, MSN) and 3.8% (MLC, DAP), com- 
pared to FIFO. PES_IPC increases write latency by 3.6% 
and 1.9% on average and at-most 6.9% (SLC, MSN) and 
4.3%(MLC, DAP), respectively. PES_IPC increases the 
write latency by 3.6% and 2.0% on average and at-most 
6.9% (SLC, MSN) and 4.3%(MLC, DAP). 
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Figure 4: Write Latency Performance Comparison: 
FIFO, RPS, PES_IPC, PES_IPS. Normalized to FIFO. 


Two major factors determine the write latency over- 
head: increased latency of each suspended P/E operation; 
the percentage of P/E that are suspended. We compare 
the original P/E latency reported by the device with la- 
tency after suspension in Fig. 5. The average overhead of 
suspended P/E is about 10.2% (SLC) and 7.8% (MLC). 
The percentage of suspended P/E is presented in Fig. 6. 
There is 4.9% (SLC) and 7.4% (MLC) of P/E’s that are 
suspended on average. These two sets of results explain 
the low write overhead our design achieves. 
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Figure 5: Compare the original write latency with the 
effective write latency resulted from P/E Suspension. Y 
axis represents the percentage of increased latency caused 

by P/E suspension. 


5.3. Sensitivity Study on Write Queue Size 


Finally, we study the sensitivity of write overhead to the 
write queue size. In order to obtain an amplified write 
overhead, we select F2, which has the highest percent- 
age of read requests, and compress the simulation time 
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Figure 6: Percentage of suspended writes. 


of F2 by 7 times to intensify the workload. In Fig- 
ure 7 we present the write latency results of RPS and 
PES_IPC (normalized to that of FIFO) by varying the 
maximum write queue size from 16 to 512. Clearly, the 
write overhead of both RPS and PES_IPC is sensitive 
to the maximum write queue size, which suggests that 
the flash controller should limit the write queue size to 
control the write overhead. Noting that, relative to RPS, 
the PES_IPC has a near-constant increase on the write 
latency, which implies that the major contributor of over- 
head is RPS when the queue size varies. 


=@—RPS = PES IPC 


Normalized Write Latency 
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Maximum Write Queue Size 
Figure 7: The write latency performance of RPS and 


PES_IPC while the maximum write queue size varies. 
Normalized to FIFO. 


6 Conclusion and Future Work 


One performance problem of NAND flash is that its pro- 
gram and erase latency is much higher than the read la- 
tency. This problem causes the chip contention between 
reads and P/Es due to the fact that with current NAND 
flash interface, the on-going P/E cannot be suspended 
and resumed. To alleviate the impact of the chip con- 
tention on the read performance, in this paper we propose 
a light-overhead P/E suspension scheme by exploiting 
the internal mechanism of P/E algorithm in NAND flash. 
The design is simulated/evaluated with precise timing 
and realistic SSD modeling of multi-chip/channel. Ex- 
perimental results show that the proposed P/E suspension 
significantly reduces the read latency with trivial over- 
head on write performance. 

Our future work will apply the idea of P/E suspension 
to further improve the performance of foreground pro- 
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cesses via suspending the background operations (e.g., 
the garbage collection operations) in SSDs. 
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Abstract 


As NAND Flash technology continues to scale down and 
more bits are stored in a cell, the raw reliability of NAND 
Flash memories degrades inevitably. To meet the reten- 
tion capability required for a reliable storage system, we 
see a trend of longer write latency and more complex 
ECCs employed in an SSD storage system. These greatly 
impact the performance of future SSDs. In this paper, we 
present the first work to improve SSD performance via 
retention relaxation. NAND Flash is typically required 
to retain data for 1 to 10 years according to industrial 
standards. However, we observe that many data are over- 
written in hours or days in several popular workloads in 
datacenters. The gap between the specification guarantee 
and actual programs’ needs can be exploited to improve 
write speed or ECCs’ cost and performance. To exploit 
this opportunity, we propose a system design that allows 
data to be written in various latencies or protected by dif- 
ferent ECC codes without hampering reliability. Simula- 
tion results show that via write speed optimization, we 
can achieve 1.8—5.7x write response time speedup. We 
also show that for future SSDs, retention relaxation can 
bring both performance and cost benefits to the ECC ar- 
chitecture. 


1 Introduction 


For the past few years, NAND Flash memories have been 
widely used in portable devices such as media players 
and mobile phones. Due to their high density, low power 
and high I/O performance, in recent years, NAND Flash 
memories begun to make the transition from portable de- 
vices to laptops, PCs and datacenters [6,35]. As the 
semiconductor industry continues scaling memory tech- 
nology and lowering per-bit cost, NAND Flash is ex- 
pected to replace the role of hard disk drives and funda- 
mentally change the storage hierarchy in future computer 
systems [14, 16]. 

A reliable storage system needs to provide a retention 
guarantee. Therefore, Flash memories have to meet the 
retention specification in industrial standards. For exam- 
ple, according to the JEDEC standard JESD47G.01 [19], 
NAND Flash blocks cycled to 10% of the maximum 
specified endurance must retain data for 10 years, and 
blocks cycled to 100% of the maximum specified en- 


durance have to retain data for 1 year. As NAND Flash 
technology continues to scale down and more bits are 
stored in a cell, the raw reliability of NAND Flash de- 
creases substantially. To meet the retention specifica- 
tion for a reliable storage system, we see a trend of 
longer write latency and more complex ECCs required in 
SSDs. For example, comparing recent 2-bit MLC NAND 
Flash memories with previous SLC ones, page write la- 
tency increased from 200 Us [34] to 1800 Us [39], and 
the required strength of ECCs went from single-error- 
correcting Hamming codes [34] to 24-error-correcting 
Bose-Chaudhuri-Hocquenghem (BCH) codes [8, 18, 28]. 
In the near future, more complex ECC codes such as low- 
density parity-check (LDPC) [15] codes will be required 
to reliably operate NAND Flash memories [13, 28,41]. 

To overcome the design challenge for future SSDs, in 
this paper, we present retention relaxation, the first work 
on optimizing SSDs via relaxing NAND Flash’s reten- 
tion capability. We observe that in typical datacenter 
workloads, e.g., proxy and MapReduce, many data writ- 
ten into storage are updated quite soon, thereby, requir- 
ing only days or even hours of data retention, which is 
much shorter than the retention time typically specified 
for NAND Flash. In this paper, we exploit the gap be- 
tween the specification guarantee and actual programs’ 
needs for SSD optimization. We make the following con- 
tributions: 


¢ We propose a NAND Flash model that captures the 
relationship between raw bit error rates and reten- 
tion time based on empirical measurement data. This 
model allows us to explore the interplay between re- 
tention capability and other NAND Flash parameters 
such as the program step voltage for write operations. 


e A set of datacenter workloads are characterized for 
their retention time requirements. Since I/O traces are 
usually gathered in days or weeks, to analyze reten- 
tion time requirements in a time span beyond the trace 
period, we present a retention time projection method 
based on two characteristics obtained from the traces, 
the write amount and the write working set size. Char- 
acterization results show that for 15 of the 16 traces 
analyzed, 49-99% of writes require less than 1-week 
retention time. 


e We explore the benefits of retention relaxation for 
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Figure 1: Incremental step pulse programming (ISPP) for programming NAND Flash 


speeding up write operations. We increase the pro- 
gram step voltage so that NAND Flash memories are 
programmed faster but with shorter retention guaran- 
tees. Experimental results show that 1.8—5.7x SSD 
write response time speedup is achievable. 


¢ We show how retention relaxation can benefit ECC 
designs for future SSDs which require concatenated 
BCH-LDPC codes. We propose an ECC architecture 
where data are encoded by variable ECC codes based 
on their retention requirements. In our ECC architec- 
ture, time-consuming LDPC is removed from the crit- 
ical performance path. Therefore, retention relaxation 
can bring both performance and cost benefits to the 
ECC architecture. 


The rest of the paper is organized as follows. Sec- 
tion 2 provides background about NAND Flash. Sec- 
tion 3 presents our NAND Flash model and the benefits 
of retention relaxation. Section 4 analyzes data reten- 
tion requirements in real-world workloads. Section 5 de- 
scribes the proposed system designs. Section 6 presents 
evaluation results regarding the designs in the previous 
section. Section 7 describes related work, and Section 8 
concludes the paper. 


2 Background 


NAND Flash memories comprise an array of floating 
gate transistors. The threshold voltage (V,,) of the tran- 
sistors can be programmed to different levels by injecting 
different amounts of charge on the floating gates. Differ- 
ent V,, levels represent different data. For example, to 
store N bits data in a cell, its Vj, is programmed to one of 
its 2" different V,, levels. 

To program V,, to the desired level, the incremen- 
tal step pulse programming (ISPP) scheme is commonly 
used [26,37]. As shown in Figure 1, ISPP increases the 
Vi, of NAND Flash cells step-by-step by a certain volt- 


FAST 712: 10th USENIX Conference on File and Storage Technologies 


age increment (1.e., AVp) and stops once Vj, is greater 
than the desired threshold voltage. Because NAND Flash 
cells have different starting V;,, the resulting V;, spreads 
across a range, which determines the precision of cells’ 
V,, distributions. The smaller AV>p is, the more precise the 
resulting Vj, is. On the other hand, smaller AVp means 
more steps are required to reach the target V;,, thereby, 
resulting in longer write latency [26]. 


NAND Flash memories are prone to errors. That is, 
the V,, level of a cell may be different from the intended 
one. The fraction of bits which contain incorrect data is 
referred to as the raw bit error rate (RBER). Figure 2(a) 
shows measured RBER of 63—72nm 2-bit MLC NAND 
Flash memories under room temperature following 10K 
program/erase (P/E) cycles [27]. The RBER at reten- 
tion time = 0 is attributed to write errors. Write errors 
have been shown mostly caused by cells with higher V;, 
than intended because the causes of write errors, such 
as program disturb and random telegraph noise, tend to 
over-program V;,. The increment of RBER after writing 
data (retention time > 0) is attributed to retention errors. 
Retention errors are caused by charge losses which de- 
crease V,,. Therefore, retention errors are dominated by 
cells with lower V,,, than intended. Figure 2(b) illustrates 
these two error sources: write errors mainly correspond 
to the tail at the high-V,, side; retention errors correspond 
to the tail at the low-V,, side. In Section 3.1, we model 
NAND Flash considering these error characteristics. 


A common approach to handle NAND Flash errors is 
to adopt ECCs (error correction codes). ECCs supple- 
ment user data with redundant parity bits to form code- 
words. With ECC protection, a codeword with a certain 
amount of bits corrupted can be reconstructed. There- 
fore, ECCs can greatly reduce the bit error rate. We refer 
to the bit error rate after applying ECCs as the uncor- 
rectable bit error rate (UBER). The following equation 
gives the relationship between UBER and RBER [27]: 


USENIX Association 


USENIX Association 


1.00E-04 


—_ —O 
1.00E-05 a 
CO a — 

o ig 
6 MS 
© 1.00E-06 & 
S | 4 
BE Knee 
> 1.00E-07 = A —— 
& ne © Measured (Manufacturer-1) 


O Measured (Manufacturer-2) 
4 Measured (Manufacturer-3) 


nee — Fitting (m=1.08) 
— —Fitting (m=1.33) ff y=a+bx™ 
----- Fitting (m=1.25) 
1.00E-09 }——, —_,— | - 
0 100 200 300 400 500 


Retention Time (Days) 


(a) Measured RBER(t) and fitting to power-law trends for 
63-72 nm 2-bit MLC NAND Flash. Data are aligned tot = 0. 


P(v) 






Write errors 


Vth 


P(v) 


MB Retention errors 


Vth 


(b) Write errors and retention errors 


Figure 2: Bit error rate in NAND Flash memories 
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Here, Ncw is the number of bits per codeword, Nyse; 18 
the number of user data bits per codeword, and f is the 
maximum number of error bits the ECC code can correct 
per codeword. 

UBER is an important reliability metric for storage 
systems and is typically required to be under 107!° to 
10~!¢ [27]. As mentioned earlier, NAND Flash’s RBER 
increases with time due to retention errors. Therefore, to 
satisfy both the retention and reliability specifications in 
storage systems, ECCs must be strong enough to tolerate 
not only write errors presenting in the beginning but also 
retention errors accumulating over time. 


3 Retention Relaxation for NAND Flash 


The key observation we make in this paper is that since 
retention errors increase over time, if we could relax the 
retention capability of NAND Flash memories, fewer re- 
tention errors need to be tolerated. These error mar- 
gins can then be utilized to improve other performance 
metrics. In this section, we first present a V,, distribu- 
tion modeling methodology which captures the RBER of 
NAND Flash. Based on the model, we elaborate on the 
strategies to exploit the benefits of retention relaxation in 
detail. 


3.1 Modeling Methodology 


We first present the base V,, distribution model for 
NAND Flash. Then we present how we extend the model 
to capture the characteristics of different error causes. 
Last, we determine the parameters of the model by fitting 
the model to the error-rate behavior of NAND Flash. 


3.1.1 Base V,, Distribution Model 


The V,, distribution is critical to NAND Flash. It de- 
scribes the probability density function (PDF) of V4, for 
each data state. Given a V;, distribution, one can evaluate 
the corresponding RBER by calculating the probability 
that a cell contains incorrect Vj, 1.e., V;, higher or lower 
than the intended level. 

V,, distributions have been modeled using bell-shape 
functions in previous studies [23,42]. For MLC NAND 
Flash memories with q states per cell, g bell-shape func- 
tions, P,(v) where 0 < k < (q—1), are employed in the 
model as follows. 

First, the V,, distribution of the erased state 1s modeled 
as a Gaussian function, Po(v): 


_ (=H)? 


Po(v) =a-e 20" (2) 





Here, Oo is the standard deviation of the distribution and 
Up is the mean. Because data are assumed to be in one of 
the q states with equal probability, a normalization coef- 
ficient, Q%&, is employed so that [, Po(v) = o 
Furthermore, the V,, distribution of each non-erased 
state (i.e., 1 <k < (q—1)) is modeled as a combination 
of a uniform distribution with width equal to AV>p in the 
middle and two identical Gaussian tails on both sides: 


(v—Ly, +0.5AVp)? 


a-e- 202 ” v<y— Ve 
P,(v) = _ (v= Hg —0.5AVp) (3) 
W) Q-e 207 v> Ut Ave 
Q, otherwise 


Here, AVp is the voltage increment in ISPP, Lz is the 
mean of each state, o is the standard deviation of the 
two Gaussian tails, and @ is again the normalization co- 
efficient to satisfy the condition that |, Py(v) = ; for the 
k — 1 states. 
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Given the V,, distribution, the RBER can be evaluated 
by calculating the probability that a cell contains incor- 
rect Vj, 1.e., Vi, higher or lower than the intended read 
voltage levels, using the following equation: 


qal VRk i 
RBER= / P,(v)dv + 
——$ eer” 


k=0 =o VR(k+1) 


Py (v)dv (4) 


V,nlower than intended V,,higher than intended 


Here, Vr, is the lower bound of the correct read voltage 
for the k’” state and Vrk+1 1S the upper bound as shown 
in Figure 3. 


State O State 1 State2 State 3 
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Figure 3: Illustration of model parameters 


3.1.2 Model Extension 


As mentioned in Section 2, the two tails of a Vj, dis- 
tribution are from different causes. The high-V,, tail of 
a distribution is mainly caused by V;,, over-programming 
(1.e., write errors); the low-V;,, tail is mainly due to charge 
losses over time (i.e., retention errors) [27]. Therefore, 
the two Gaussian tails may not be identical. To capture 
this difference, we extend the base model by setting dif- 
ferent standard deviations to the two tails as shown in 
Figure 3. 

The two standard deviations are set based on the ob- 
servation in the previous study on Flash’s retention pro- 
cess [7]. Under room temperature!, a small portion of 
cells have a much larger charge-loss rate than others. As 
such charge losses accumulate over time, the distribution 
tends to form a wider tail at the low-V,, side. There- 
fore, we extend the base model by setting the standard 
deviation of the low-V,, tail to be a time-increasing func- 
tion, Ojow(t), but keeping Opjgy time-independent. The 
extended model is as follows: 


= (v— py, +0.5AVp)? 
Pr(v,t) = _ (=H =0.5AVp)? (5) 
“ 2Chioh” AV, 
a(t) -e high V> f+ =e 
a(t), otherwise 


Here, the normalization term becomes a function of time, 
a(t), to keep |, Py(v,t) = 7 


' According to the previous study [32], in datacenters, HDDs’ av- 
erage temperatures range between 18—51°C and stay around 26—30°C 
most of the time. Since SSDs do not contain motors and actuators, we 
expect SSDs should be in lower temperature than HDDs. Therefore, 
we only consider room temperature in our current model. 
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We should note that keeping Opj;., time-independent 
does not imply that cells with high Vj, are time- 
independent and never leak charge. Since the integral 
of PDF for each data state remains 7 the probability that 
a cell belongs to the high-V;,, tail drops as the low-V;, tail 
widens over time. The same phenomenon happens to the 
middle part, too. 

Given the V,, distribution in the extended model, 
RBER(t) can be evaluated using the following formula: 


q7| VRk oo 
RBER(t) =o / P,(v,t)dv + 
a 


k=0 =e VR (k-+1) 


rvs (6) 


V,, lower than intended V,,higher than intended 


3.1.3 Model Parameter Fitting 


In the proposed V,, distribution model, AVp, Vax, Ux, 
and Oo are set to the values shown in Figure 4 according 
to [9]. The two new parameters in the extended model, 
Onigh and Ojo,,(t), are determined through parameter fit- 
ting such that the resulting RBER(t) follows the error- 
rate behavior of NAND Flash. Below we describe the 
parameter fitting procedure. 

We adopt the power-law model [20] to describe the 
error-rate behavior of NAND Flash: 


RBER(t) = RBER write + RBER yetention X (7) 


Here, ¢ is time, m is a coefficient, 1 < m < 2, RBERw rite 
corresponds to the error rate at t = O (1.e., write errors), 
and RBER,etention 18 the incremental error rate per unit of 
time due to retention errors. 

We determine m in the power-law model based on the 
curve-fitting values shown in Figure 2(a). In the figure, 
the power-law curves fit the empirical error-rate data very 
well with m equal to 1.08, 1.25, and 1.33. We consider 
1.25 as the typical case of m and consider the other two 
values as the corner cases. 

The other two coefficients in the power-law model, 
RBER write aNd RBER,etention, can be solved given RBER 
at t = 0 and RBER at the maximum retention time, fax. 
According to the JEDEC standard JESD47G.01 [19], 
NAND Flash blocks cycled to the maximum specified 
endurance have to retain data for 1 year, so we set ting, to 
1 year. Moreover, recent NAND Flash requires 24-bit er- 
ror correction for 1080-byte data [4,28]. Assuming that 
the target UBER(tmax) requirement is 10~!°, by Equa- 
tion (1), we have: 


RBER(tmax) = 4.5 x 10~4 (8) 


As shown in Figure 2(a), RBER(O) is typically orders 
of magnitude lower than RBER(tnq,). Tanakamaru et 
al. [41] also show that write errors are between 150 
to 450 fewer than retention errors. This is because re- 
tention errors accumulate over time and eventually dom- 
inate. Therefore, we set RBER,,,jze accordingly: 
— RBER(tnax) 


Cwiiie 


RBER rite = RBER(0) (9) 


USENIX Association 


USENIX Association 

















0.2 0.4 0.6 0.8 1 
Time (Years) 





Figure 4: Modeled 2-bit MLC NAND Flash 
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Figure 5: Modeling results 


Here, Cwrite is the ratio of RBER(tng,) to RBER(O). We 
Set Cyrite to 150, 300, and 450, where 300 is considered 
as the typical case and the other two are considered as 
the corner cases. 

We also have RBER,erention aS follows: 


(RBER(tmax) — RBER(0)) 


RBERyetention = ( 1 0) 





m 
t max 


We note that among write errors (1.e., RBERwrite), a 
major fraction of them correspond to cells with higher 
V,, than intended. This is because the root causes of 
write errors tend to make V,;,, over-programmed. Mielke 
et al. [27] show that this fraction is between 62% to about 
100% for the NAND Flash devices in their experiments. 
Therefore, we give the following equations: 


RBER rite_high = RBER rite X Cwrite_high 
RBERyite_low — RBER rite x (1 a Grito Hii) 


(11) 
(12) 


Here, RBERyrite nigh ANd RBERyyite low COorrespond to 
cells with Vj, higher and lower than intended levels, re- 
spectively. Cyrite_nign Stands for the ratio of total write 
errors to write errors which are higher than the intended 
levels. We set Cyrite_nign to 62%, 81%, and 99%, where 
81% is considered as the typical case and the other two 
are considered as the corner cases. 

Now we have the error-rate behavior of NAND Flash. 
Onigh ANd Ojow(O) are first determined so that the error 
rate for V,, being higher and lower than intended equals 
RBER rite high ANd RBERyyrite iow, respectively. Then, 
Ojow(t) is determined by matching the RBER(t) derived 


from the V;, model with NAND Flash’s error-rate behav- 
ior described in Equations (7) to (10) at a fine time step. 

Figure 5 shows the modeling results of the V,,, distri- 
bution for the typical-case NAND Flash. In this figure, 
the solid line stands for the V,, distribution at t = 0; the 
dashed line stands for the V,, distribution at t = 1 year. 
We can see that the 1-year distribution is flatter than the 
distribution at t = 0. We can also see that as the low-V,, 
tail widens over a year, the probability of both the middle 
part and the high-V,, tail drops correspondingly. 


3.2 Benefits of Retention Relaxation 


In this section, we elaborate on the benefits of reten- 
tion relaxation from two perspectives — improving write 
speed and improving ECCs’ cost and performance. The 
analysis is based on NAND Flash memories cycled to 
the maximum specified endurance (i.e., 100% wear-out) 
with data retention capability set to 1 year [19]. Since 
NAND Filash’s reliability typically degrades monotoni- 
cally in terms of P/E cycles, considering such an extreme 
case 1s conservative for the following benefit evaluation. 
In other words, NAND Flash in its early lifespan has 
more head room for optimization. 


3.2.1 Improving Write Speed 


As presented earlier, NAND Flash memories use the 
ISPP scheme to incrementally program memory cells. 
The V;, step increment, AVp, directly affects write speed 
and data retention. Write speed is proportional to AVp 
because with larger AVp, less steps are required during 
the ISPP procedure. On the other hand, data retention 
decreases as AVp gets larger because large AVp widens 
V,, distributions and reduces the margin for tolerating re- 
tention errors. 

Algorithm 1 shows the procedure to quantitatively 
evaluate the write speedup if data retention time require- 
ments are reduced. The analysis is based on the extended 
NAND Flash model presented in Section 3.1. For all 
the typical and corner cases we consider, we first enlarge 
AVp by various ratios between | x to 3x, thereby, speed- 
ing up NAND Flash writes proportionately. For each ra- 
tio, we test RBER(t) at different retention time from 0 to 
1 year to find the maximum ¢ such that RBER(t) is within 
the base ECC strength. 

Figure 6 shows the write speedup vs. data reten- 
tion. Both the typical case (black line) and the corner 
cases (gray dashed lines) we consider in Section 3.1 are 
shown. For the typical case, if data retention is relaxed 
to 10 weeks, 1.86 speedup for NAND Flash page write 
is achievable; if data retention is relaxed to 2 weeks, the 
speedup is 2.33 x. Furthermore, the speedup for the cor- 
ner cases are close to the typical case. This means the 
speedup numbers are not very sensitive to the values of 
the parameters we obtain using parameter fitting. 
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Figure 6: NAND page write speedup vs. data reten- 
tion 


Algorithm 1 Write speedup vs. data retention 


1: Cecy = 4.5 Xx 10-4 
2: for all typical and corner cases do 
3: for VoltageRatio = | to 3 step = 0.01 do 


4: Enlarge AVp by VoltageRatio times 
5: WriteSpeedUp = VoltageRatio 
6: for Time t = 0 to 1 year step = 6 do 
7 Find RBER(t) according to Ojgy(t) and a(t) 
8: end for 
9: DataRetention = max{t:RBER(t) < Cgcy} 
10: plot (DataRetention, WriteSpeedUp) 
11: end for 
12: end for 


3.2.2 Improving ECCs’ Cost and Performance 


ECC design is emerging as a critical issue in SSDs. 
Nowadays, NAND Flash-based systems heavily rely on 
BCH codes to tolerate RBER. Unfortunately, BCH de- 
grades memory storage efficiency significantly once the 
RBER of NAND Flash reaches 10~° [22]. Recent 
NAND Flash has RBER around 4.5 x 107+. As the den- 
sity of NAND Flash memories continues to increase, 
RBER will exceed the BCH limitation inevitably. There- 
fore, BCH codes will become inapplicable in the near 
future. 


LDPC codes are promising ECCs for future NAND 
Flash memories [13, 28,41]. The main advantage of 
LDPC is that they can provide correction performance 
very close to the theoretical limits. However, LDPC 
incurs much higher encoding complexity than BCH 
does [21, 25]. For example, an optimized LDPC en- 
coder [44] consumes 3.9 M bits of memory and 11.4 k 
FPGA Logic Elements to offer 45 MB/s throughput. To 
sustain write throughput of high-performance SSDs, e.g., 
1 GB/s ones [1], high-throughput LDPC encoders are re- 
quired, otherwise the LDPC encoders may become the 
throughput bottleneck. This leads to high hardware cost 
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Figure 7: Data retention capability of 24 error correc- 
tion per 1080 bytes BCH codes 


because hardware parallelization is one basic approach to 
increase the throughput of LDPC encoders [24]. In this 
paper, we exploit retention relaxation to alleviate such 
cost and performance dilemma. That is, with retention 
relaxation, fewer retention errors need to be tolerated; 
therefore, BCH codes could be still strong enough to pro- 
tect data even if NAND Flash’s 1-year RBER soars. 

Algorithm 2 analyzes the achievable data retention 
time of BCH codes with 24 bits per 1080 bytes 
error-correction capability under different NAND Flash 
RBER(1 year) values. Here we assume that RBER(t) fol- 
lows the power-law trend described in Section 3.1.3. We 
vary RBER(1 year) from 4.5 x 1074 to 1 x 107!, and de- 
rive the corresponding write error rate (RBER,,,jt-) and 
retention error increment per unit of time (RBER,erention). 
The achievable data retention time of the BCH codes is 
the time when RBER exceeds the capability of the BCH 
codes (i.e., 4.5 x 107%). 


Algorithm 2 Data retention vs. maximum RBER for 
BCH (24-bit per 1080 bytes) 

1: tmax = 1 year 

2: Cecy = 4.5 x 1074 

3: for all typical and corner cases do 


4 for RBER(tmax) = 4.5 X 10-* to 1 x 107! step = 6 do 
5 RBER rite —= BBR nas) 

6: RBERyetention = {RBER nar RB ER erie) 

7 RetentionTime = ( agp REE Ruri ) 

8 plot (RBER(tmax), RetentionTime) 

9: end for 
10: end for 


Figure 7 shows the achievable data retention time of 
the BCH code given different RBER(1 year) values. The 
black line stands for the typical case and the gray dashed 
lines stand for the corner cases. As can be seen, for the 
typical case, the baseline BCH code can retain data for 
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Figure 8: Data retention requirements in a write stream 


10 weeks even if RBER(1 year) reach 3.5 x 10~°. Even 
if RBER(1 year) reaches 2.2 x 10~, the baseline BCH 
code can still retain data for 2 weeks. We can also see 
similar trends for the corner cases. 


4 Data Retention Requirements Analysis 


In this section, we first analyze real-world traces from 
enterprise datacenters to show that many writes into stor- 
age require days or even shorter retention time. Since I/O 
traces are usually gathered in days or weeks, to estimate 
the percentage of writes with retention requirements be- 
yond the trace period, we present a retention-time projec- 
tion method based on two characteristics obtained from 
the traces, the write amount and the write working set 
SIZe. 


4.1 Real Workload Analysis 


In this subsection, we analyze real disk traces to under- 
stand the data retention requirements of real-world ap- 
plications. The data retention requirement of a sector 
written into a disk is defined as: the interval from the 
time the sector is written to the time the sector is over- 
written. Let’s take Figure 8 for example. The disk is 
written by the address stream a, b, b, a, c, a, ... and 
so on. The first write is to address a at time O, and the 
same address is overwritten at time 3; therefore, the data 
retention requirement for the first write is (3 — 0) = 3. 
Usually disk traces only cover a limited period of time, 
for those writes whose next write does not appear before 
the observation ends, the retention requirements cannot 
be determined. For example, for the write to address b at 
time 2, the overwritten time is unknown. We denote its 
retention requirement with ‘?’ as a conservative estima- 
tion. It is important to note that we are focusing on data 
retention requirements for data blocks in write streams 
rather than that in the entire disk. 

Table 1 shows the three sets of traces we analyze. 
The first is from an enterprise datacenter in Microsoft 
Research Cambridge (MSRC) [29]. This set covers 36 
volumes from various servers and we select 12 of them 
which have the largest write amounts. These traces span 
1 week and 7 hours. We skip the first 7 hours which 
do not form a complete day and use the remaining 1- 
week part. The second set of traces is MapReduce which 
has been shown to benefit from the increased bandwidth 
and reduced latency of NAND Flash-based SSDs [11]. 
We use Hadoop [2] to run the MapReduce benchmark 
on a cluster of two Core-17 machines each of which has 


Table 1: Workload summary 


prn_O 

proj_O, proj_2 

prxy_0, prxy_l 
a srcl_0, srcl_2 

src2_2 

usr_l, usr_2 


Print server 
Project directories 
Web proxy 
Source control 
Source control 


1 week 
User home directories 


MapReduce . WordCount benchmark 
tpcc2 


8 GB RAM and a SATA hard disk and runs 64-bit Linux 
2.6.35 with the ext4 filesystem. We test two MapRe- 
duce usage models. In the first model, we repeatedly re- 
place 140 GB text data in the Hadoop cluster and invoke 
word counting jobs. In the second model, we interleave 
performing word counting jobs on two sets of 140 GB 
text data which have been pre-loaded in the cluster. The 
third workload is the TPC-C benchmark. We use Ham- 
merora [3] to generate the TPC-C workload on a MySql 
server which has a Core-i17 CPU, 12 GB RAM, and a 
SATA SSD and runs 64-bit Linux 2.6.32 with the ext4 
filesystem. We configure the benchmarks as having 40 
and 80 warehouses. Each warehouse has 10 users with 
keying and thinking time. Both MapReduce and TPC-C 
workloads span | day. 

For each trace, we analyze the data retention require- 
ment of every sector written into the disk. Figure 9 shows 
the cumulative percentage of data retention requirements 
less than or equal to the following values — a second, 
a minute, an hour, a day, and a week. As can be seen, 
the data retention requirements of the workloads are usu- 
ally low. For example, more than 95% of sectors written 
into the disk for proj_O, prxy_l, tpccl, and tpcc2 need 
less than 1-hour data retention. Furthermore, for all the 
traces except proj_2, 49-99.2% of sectors written into 
the disk need less than 1-week data retention. For tpcc2, 
up to 44% of writes require less than 1-second retention. 
This is because MySql’s storage engine, InnoDB, writes 
data to a fixed-size log, called the doublewrite buffer, be- 
fore writing to the data file to guard against partial page 
writes; therefore, all writes to the doublewrite buffer are 
overwritten very quickly. 





4.2 Retention Requirement Projection 


The main challenge of retention time characterization for 
real-world workloads is that I/O traces are usually gath- 
ered in a short period of time, e.g., days or weeks. To 
estimate the percentage of writes with retention require- 
ments beyond the trace period, we derive a projection 
method based on two characteristics obtained from the 
traces, the write amount and the write working set size. 
We denote the percentage of writes with retention time 
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Figure 9: Data retention requirement distribution 


requirements less than X within a time period of Y as 
Sx y%. We first formulate S7, 7,% in the write amount 
and the write working set size, where 7; is the time span 
of the trace. Let N be the amount of data sectors written 
into the disk during 7; and W be the write working set 
size (i.e., the number of distinct sector addresses being 
written) during 7}. We have the following formula (the 
proof is similar to the pigeonhole principle): 
N—W W 


Snn = = by 


With this formula, the first projection we make is the 
percentage of writes that have retention time require- 
ments less than 7] in an observation period of 7, where 
To =kxT, k € N. The projection is based on the as- 
sumption that for each T; period, the write amount and 
the write working set size remain N and W, respectively. 
We derive the lower bound on S77, 7, % as follows: 


(13) 


k-1 
k(N—W) + y Uj 
i=l 


St, 7, % = > S7, 7, % (14) 


kN 


where u; is the number of sectors whose lifetime is across 
two periods and their retention time requirements are less 
than 7;. Equation (14) implies that we do not overesti- 
mate the percentage of writes that have retention time 
requirements less than 7; by characterizing a trace gath- 
ered in a 7] period. 

With the first projection, we can then derive the lower 
bound on S7, 7, %. Clearly, S7, 7, % = Sv, 7, %. Combined 
with Equation (14), we have: 


St,T % 2 St,,7,% = St,,7, % BD) 


The lower bound on S7, 7, % also depends on disk ca- 
pacity, A. During 75, the write amount is equal to k x N, 
and the write working set size must be less than or equal 
to the disk capacity, 1e, kx W <A. By with Equa- 
tion (13), we have: 
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kKN—A A 

STy,T, 7 2 kN =1= kN 

Combining Equation (15) and (16), the lower bound 
on S7, 7, % is given by: 


(16) 


A 
ST, T, To > max(1 — EN’ ST, 7; %) 


Table 2 shows the data retention requirements anal- 
ysis using the above equations. First, we can see that 
the S7, 7,% obtained from Equation (13) matches Fig- 
ure 9. Let’s take hd2 for example. There are a total 
of 726 GB of writes in 1 day whose write working set 
size is 313 GB. According to Equation (13), 57% of the 
writes whose retention time requirements are less than 
1 day. This is the case shown in Figure 9. Furthermore, 
if we can observe the hd_2 workload for | week, more 
than 86% of writes whose retention time requirements 
are expected to be less than 1 weeks. This again shows 
the gap between the specification guarantee and actual 
programs’ needs in terms of data retention. 


(17) 


5 System Design 


5.1 Retention-Aware FTL (Flash Transla- 
tion Layer) 


In this section, we present the SSD design which lever- 
ages retention relaxation for improving either write 
speed or ECCs’ cost and performance. Specifically, in 
the proposed SSD, data written into NAND Flash mem- 
ories could occur in variable write latencies or be en- 
coded by different ECC codes, which provide different 
levels of retention guarantees. We refer to the data writ- 
ten by these different methods as in different “modes”. 
In our design, data in a physical NAND Flash block 
are in the same mode. To correctly retrieve data from 
NAND Flash, we need to record the mode of each phys- 
ical block. Furthermore, to avoid data losses due to a 
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Table 2: Data retention requirements analysis 
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Name 

















' The disk capacity of the MSRC traces are estimated using their max- 
imum R/W address. The estimation results conform to the previous 
study [30]. 

2 1d, lw, and 5w stand for a day, a week, and 5 weeks, respectively. 


> GB stands for 2°” bytes. 


shortage of data retention capability, we have to monitor 
the remaining retention capability of each NAND Flash 
block. We implement the proposed retention-aware de- 
sign in the Flash Translation Layer (FTL) in SSDs rather 
than in OSes. FTL-based implementation requires mini- 
mum OS/application modification, which we think is im- 
portant for easy deployment and wide adoption of the 
proposed scheme. 


Figure 10 shows the block diagram of the proposed 
FTL. The proposed FTL is based on the page-level 
FTL [5] with two additional components, Mode Selec- 
tor (MS) and Retention Tracker (RT). For writes, MS 
sends different write commands to NAND Flash chips 
or invokes different ECC encoders. As discussed in Sec- 
tion 3.2.1, write speed could be improved by adopting 
larger AVp. In current Flash chips, only one write com- 
mand is supported. To support the proposed mechanism, 
NAND Flash chips need to provide multiple write com- 
mands with different AVp values. MS keeps the mode 
of each NAND Flash block in memories so that during 
reads, it can invoke the right ECC decoder to retrieve 
data. RT is responsible for ensuring that every NAND 
Flash block in the SSD does not run out of its retention 
capability. RT uses one counter per NAND Flash block 
to keep track of its remaining retention time. When the 
first page of a block is written, the retention capability of 
this write is stored in the counter. These retention coun- 
ters are periodically updated. If a block is found to ap- 
proach its data retention limit, RT schedules background 
operations to move valid data in this block to another new 
block and then invalidates the old one. 


One main parameter in the proposed SSD design is 
how many write modes we should employ in the SSD. 
The optimal setting depends on retention time varia- 
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Figure 10: Proposed retention-aware FTL 


tion in workloads and the cost for supporting multiple 
write modes. In this work, we present a coarse-grained 
management method. There are two kinds of NAND 
Flash writes in SSD systems: host writes and background 
writes. Host writes correspond to write requests sent 
from the host to the SSDs; background writes comprise 
cleaning, wear-leveling, and data movement internal to 
the SSDs. Performance is usually important to the host 
writes. Moreover, host writes usually require short data 
retention as shown in Section 4. In contrast, background 
writes are less sensitive to performance and usually in- 
volve data which have been stored in the storage for a 
long time; therefore, their data are expected to remain 
for a long time in the future (commonly referred to as 
cold data). Based on this observation, we propose to em- 
ploy two levels of retention guarantees for the two kinds 
of writes. For host writes, retention-relaxed writes are 
used to exploit their high probability of short retention 
requirements and gain performance benefits; for back- 
ground writes, normal writes are employed to preserve 
the retention guarantee. 


In the proposed two-level framework, to optimize 
write performance, host writes occur in fast write speed 
with reduced retention capability. If data are not over- 
written within their retention guarantee, background 
writes with normal write speed are issued. To optimize 
ECCs’ cost and performance, a new ECC architecture 
is proposed. As mentioned earlier, NAND Flash RBER 
will soon exceed BCH’s limitation (i.e., RBER > 107%); 
therefore, advanced ECC designs will be required for fu- 
ture SSDs. Figure 11 shows such an advanced ECC de- 
sign for future SSDs which employs multi-layer ECCs 
with code concatenations: the inner code is BCH, and 
the outer code is LDPC. Concatenating BCH and LDPC 
exploits the advantages of both [43]: LDPC greatly im- 
proves the maximum correcting capability, while BCH 
complements LDPC for eliminating LDPC’s error floor. 
The main issue with this design is since every write needs 
to be encoded in LDPC, a high-throughput LDPC en- 
coder is required to prevent the LDPC encoder from be- 
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Figure 11: Baseline concatenated BCH-LDPC codes 
in an SSD 


ing the bottleneck. In the proposed ECC architecture 
shown in Figure 12, host writes are protected by BCH 
only since they tend to have short retention requirements. 
If data are not overwritten within the retention guaran- 
tee provided by BCH, background writes are issued. All 
background writes are protected by LDPC. In this way, 
the LDPC encoder is kept out of the critical performance 
path. Its benefits are two-fold. First, write performance 
is improved since host writes do not go through time- 
consuming LDPC encoding. Second, since BCH filters 
out short-lifetime data and LDPC encoding can be amor- 
tized in the background, the throughput requirements of 
LDPC are less than the baseline design. Therefore, the 
LDPC hardware cost can be reduced. 

We present two specific implementation of retention 
relaxation. The first one relaxes the retention capabil- 
ity of host writes to 10 weeks and periodically checks 
the remaining retention capability of each NAND Flash 
block at the end of every 5 weeks. Therefore, FTL al- 
ways has another 5 weeks at least to reprogram those 
data which have not been overwritten in the past period 
and can amortize the re-programming task in the back- 
ground over the 5 weeks without causing burst writes. 
We set the period of invoking the reprogramming tasks 
to 100 ms. The second one 1s similar to the first one ex- 
cept that the retention capability and checking period are 
2 weeks and | week, respectively. These two designs are 
referred to as RR-10week and RR-2week in this paper. 


5.2 Overhead Analysis 
Memory Overhead 


The proposed mechanism requires extra memory re- 
sources to store write modes and retention time infor- 
mation for each block. Since we only have two write 
modes, 1.e., the normal mode and the retention-relaxed 
one, each block requires only a 1-bit flag to record its 
write mode. As for the size of the counter for keeping 
track of the remaining retention time, both RR-2week 
and RR-lOweek require only a 1-bit counter per block 
because all retention-relaxed blocks written in the n”” pe- 
riod are reprogrammed during the (n+ 1)" period. For 
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Figure 12: Proposed ECC architecture leveraging re- 
tention relaxation in an SSD 
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Figure 15: Wear-out overhead of retention relaxation 


an SSD having 128 GB NAND Flash with 2 MB block 
size, the memory overhead is 16 KB . 


Reprogramming Overhead 


In the proposed schemes, data that are not overwrit- 
ten in the guaranteed retention time need to be repro- 
grammed. These extra writes affect both the performance 
and the life time of SSDs. To analyze its performance 
impact, we estimate reprogramming amounts per unit of 
time based on the projection method described in Sec- 
tion 4.2. Here, we let 7> be the checking period in the 
proposed schemes. For example, for RR-10week, 7) 
equals 5 weeks. Therefore, at the end of each period, the 
total write amount is kN, the percentage of writes which 
require reprogramming is at most (1 — Sz, 7,%), and the 
reprogramming tasks can be amortized over the upcom- 
ing period of 7>. The reprogramming amounts per unit 
of time are as follows: 
(1—S7,7,%)xkxN (18) 
Ty 
The results show that the amount of reprogramming 
tasks range between 1.13 kB/s to 1.25 MB/s for RR- 
2week, and between 1.13 kB/s to 0.26 MB/s for RR- 
10week. Since each NAND Flash plane can provide 
6.2 MB/s write throughput (1.e., writing a 8 kB page in 
1.3 ms), we anticipate that reprogramming does not lead 
to high performance overhead. In Section 6, we evaluate 
its actual performance impact. 
To quantify the wear-out effect caused by reprogram- 
ming, we show extra writes per cell per year assuming 
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Figure 13: SSD write response time speedup 


perfect wear-leveling. We first give the upper bound on 
this metric. Let’s take RR-2week for example. In the ex- 
treme case, RR-2week reprograms the entire disk every 
week, which leads to 52.1 extra writes per cell per year. 
Similarly, RR-1Oweek causes 10.4 extra writes per cell 
per year at most. These extra writes are not significant 
compared to NAND Flash’s endurance which is usually 
a few thousands P/E cycles. Therefore, even in the worst 
case, the proposed mechanism does not cause significant 
wear-out effect. For real workloads, the wear-out over- 
head is usually smaller than the worst case as shown in 
Figure 15. The wear-out overhead for each workload is 
evaluated based on the disk capacity and the reprogram- 
ming amounts per unit of time presented above. 


6 System Evaluation 


We conduct simulation-based experiments using 
SSDsim [5] and Disksim-4.0 [10] to evaluate the RR- 
10week and RR-2week designs. SSDs are configured to 
have 16 channels. Detailed configurations and parame- 
ters are listed in Table 3. Eleven of the 16 traces listed in 
Table 2 are used and simulated for the whole trace. We 
omit prxy_l because the simulated SSD can not sustain 
its load, and prn_l, srcl_l, usr_l, usr_2 are also omitted 
because they contain write amounts less than 15% of the 
total raw NAND Flash capacity. SSD write speedup and 
ECCs’ cost and performance improvement are evaluated 
separately. The reprogramming overhead described in 
Section 5.2 are considered in the experiments. 

Figure 13 shows the speedup of write response time 
for different workloads if we leverage retention relax- 
ation to improve write speed. We can see that RR- 
10week and RR-2week typically achieve 1.8—2.6x write 
response time speedup. hdl and hd2 show up to 
3.9-5.7x speedup. These two workloads have high 
queuing delay due to high I/O throughput. With reten- 
tion relaxation, the queuing time is greatly reduced, be- 
tween 3.7 to 6.1. Moreover, for all workloads, RR- 
2week gives about 20% extra performance gain over RR- 
10week. Figure 14 shows the speedup in terms of overall 
response time. The overall response time is mainly deter- 
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Figure 14: SSD overall response time speedup 


Table 3: NAND Flash and SSD configurations 






































Parameter Value 
Over-provisioning | 15% 
Cleaning threshold 5% Page write latency 1.3 ms 

Page size 8 KB Block erase latency 3.8 ms 
Pages per block 256 NAND bus bandwidth 200 MB/s 
Blocks per plane 2000 

Planes per die 2 
Dies per channel 1~8 
Number of channel 16 
Mapping policy | Full stripe 

Trace Name Dies per Disk | Exported Capacity (GB) 
prn_O, proj_0, prxy_0, src1_2 16 106 
sre2_2 32 212 
srcl_O 64 423 
proj_2, hd1, hd2, tpcc1, tpcc2 128 847 











mined by write requests due to the significant amount of 
write requests in the tested workloads and the long write 
latency. Therefore, we can see that the speedup trend is 
similar to that of write response time. 

To show how retention relaxation benefits ECC design 
in future SSDs, we consider SSDs comprising NAND 
Flash whose 1-year RBER approaches 2.2 x 1077. We 
compare the proposed RR-2week design with the base- 
line design which employs concatenated BCH-LDPC 
codes. The LDPC encoder is modeled as a FIFO and its 
throughput is chosen among 5, 10, 20, 40, 80, 160, 320, 
and co MB/s. Since the I/O queue of the simulated SSDs 
could saturate if LDPC’s throughput is insufficient, we 
first report the minimum required throughput configura- 
tions without causing saturation in Figure 16. As can be 
seen, for the baseline ECC architecture, throughput up 
to 160 MB/s is required. In contrast, for RR-2week, the 
lowest throughput configuration (i.e., SMB/s) is enough 
to sustain the write rates in all tested workloads. Fig- 
ure 17 shows the response time of the baseline and RR- 
2week under various LDPC throughput configurations. 
The response time reported in this figure is the average 
of the response time normalized to that with unlimited 
LDPC throughput: 


Res ponseTime; ) (19) 
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Figure 16: Minimum required LDPC throughput 
configurations without causing I/O queue saturation 


where WN is the number of workloads which do not in- 
cur I/O queue saturation given specific LDPC through- 
put. In the figure, the curve of the baseline presents 
a Zigzag appearance between 5 MB/s to 80 MB/s be- 
cause several traces are excluded due to the saturation 
in the I/O queue. This may inflate the performance of the 
baseline. Even so, we see RR-2week outperforms the 
baseline significantly with the same LDPC throughput 
configuration. For example, with 1OMB/s throughput, 
RR-2week performs 43% better than the baseline. Only 
when the LDPC throughput approaches infinite does RR- 
2week perform a bit worse than the baseline due to re- 
programming overhead. We can also see that with a 
20 MB/s LDPC, RR-2week already approaches the per- 
formance of unlimited LDPC throughput, while the base- 
line requires 160 MB/s to achieve the similar level. Be- 
cause hardware parallelization is one basic approach to 
increase the throughput of a LDPC encoder [24], in this 
point of view, retention relaxation can reduce the hard- 
ware cost of LDPC encoders by 8x. 


7 Related Work 


Access frequencies are usually considered in storage op- 
timization. Chiang et al. [12] propose to cluster data 
with similar write frequencies together to increase SSDs’ 
cleaning efficiency. Pritchett and Thottethodi [33] ob- 
serve the skewness of disk access frequencies in datacen- 
ters and propose novel ensemble-level SSD-based disk 
caches. In contrast, we focus on the time interval be- 
tween two successive writes to the same address which 
defines the data retention requirement. 

Several device-aware optimizations for NAND Flash- 
based SSDs were proposed recently. Grupp et al. [17] 
exploit the variation in page write speed in MLC NAND 
Flash to improve SSDs’ responsiveness. Tanakamaru et 
al. [40] propose wear-out-aware ECC schemes to 1m- 
prove the ECC capability. Xie et al. [42] improve write 
speed through compressing user data and employing 
stronger ECC codes. Pan et al. [31] improve write speed 
and defect tolerance using wear-out-aware policies. Our 
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Figure 17: Average normalized response time given 
various LDPC throughput configurations 


work considers the retention requirements of real work- 
loads and relaxes NAND Flash’s data retention to op- 
timize SSDs, which is orthogonal to the above device- 
aware optimization. 

Smullen et al. [36] and Sun et al. [38] improve energy 
and latency of STTRAM-based CPU caches through re- 
designing STTRAM cells with relaxed non-volatility. In 
contrast, we focus on NAND Flash memories used in 
storage systems. 


$ Conclusions 


We present the first work on optimizing SSDs via re- 
laxing NAND Flash’s data retention capability. We de- 
velop a NAND Flash model to evaluate the benefits if 
NAND Flash’s original multi-year data retention can 
be reduced. We also demonstrate that in real systems, 
write requests usually require days or even shorter re- 
tention times. To optimize the write speed and ECCs’ 
cost and performance, we design SSD systems which 
handle host writes with shortened retention time while 
handling background writes as usual and present corre- 
sponding retention tracking schemes to guarantee that no 
data loss happens due to a shortage of retention capa- 
bility. Simulation results show that the proposed SSDs 
achieve 1.8—5.7 x write response time speedup. We also 
show that for future SSDs, retention relaxation can bring 
both performance and cost benefits to the ECC architec- 
ture. We leave simultaneously optimizing write speed 
and ECCs as our future work. 
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Abstract 


Over the last decade we have witnessed the relent- 
less technological improvement in flash-based solid- 
state drives (SSDs) and they have many advantages over 
hard disk drives (HDDs) as a secondary storage such as 
performance and power consumption. However, the ran- 
dom write performance in SSDs still remains as a con- 
cern. Even in modern SSDs, the disparity between ran- 
dom and sequential write bandwidth is more than ten- 
fold. Moreover, random writes can shorten the limited 
lifespan of SSDs because they incur more NAND block 
erases per write. 

In order to overcome these problems due to random 
writes, in this paper, we propose a new file system 
for SSDs, SFS. First, SFS exploits the maximum write 
bandwidth of SSD by taking a log-structured approach. 
SES transforms all random writes at file system level to 
sequential ones at SSD level. Second, SFS takes a new 
data grouping strategy on writing, instead of the existing 
data separation strategy on segment cleaning. It puts the 
data blocks with similar update likelihood into the same 
segment. This minimizes the inevitable segment clean- 
ing overhead in any log-structured file system by allow- 
ing the segments to form a sharp bimodal distribution of 
segment utilization. 

We have implemented a prototype SFS by modifying 
Linux-based NILFS2 and compared it with three state- 
of-the-art file systems using several realistic workloads. 
SFS outperforms the traditional LFS by up to 2.5 times 
in terms of throughput. Additionally, in comparison to 
modern file systems such as ext4 and birfs, it drastically 
reduces the block erase count inside the SSD by up to 
7.5 times. 


1 Introduction 


NAND flash memory based SSDs have been revolution- 
izing the storage system. An SSD is a purely electronic 
device with no mechanical parts, and thus can provide 
lower access latencies, lower power consumption, lack 
of noise, shock resistance, and potentially uniform ran- 
dom access speed. However, there remain two serious 
problems limiting wider deployment of SSDs: limited 
lifespan and relatively poor random write performance. 


,wonlee®,yieom® } @ece.skku.ac.kr, hj1120.cho° @samsung.com 


The limited lifespan of SSDs remains a critical concern 
in reliability-sensitive environments, such as data cen- 
ters [5]. Even worse, the ever-increased bit density for 
higher capacity in NAND flash memory chips has re- 
sulted in a sharp drop in the number of program/erase 
cycles from 10K to 5K for the last two years [4]. Mean- 
while, previous work [12, 9] shows that random writes 
can cause internal fragmentation of SSDs and thus lead 
to performance degradation by an order of magnitude. In 
contrast to HDDs, the performance degradation in SSDs 
caused by the fragmentation lasts for a while after ran- 
dom writes are stopped. The reason for this is that ran- 
dom writes cause the data pages in NAND flash blocks 
to be copied elsewhere and erased. Therefore, the lifes- 
pan of an SSD can be drastically reduced by random 
writes. 


Not surprisingly, researchers have devoted much ef- 
fort to resolving these problems. Most of work has been 
focused on a flash translation layer (FTL) — an SSD 
firmware emulating an HDD by hiding the complex- 
ity of NAND flash memory. Some studies [24, 14] im- 
proved random write performance by providing more ef- 
ficient logical to physical address mapping. Meanwhile, 
other studies [22, 14] propose a separation of hot/cold 
data to improve random write performance. However, 
such under-the-hood optimizations are purely based on 
logical block addresses (LBA) requested by a file sys- 
tem so that they would become much less effective for 
the no-overwrite file systems [16, 48, 10] in which ev- 
ery write to the same file block is always redirected to 
a new LBA. There are other attempts to improve ran- 
dom write performance especially for database systems 
[23, 39]. Each attempt proposes a new database stor- 
age scheme, taking into account the performance char- 
acteristics of SSDs. However, despite the fact that these 
flash-conscious techniques are quite effective in specific 
applications, they cannot provide the benefit of such op- 
timization to general applications. 


In this paper, we propose a novel file system, SFS, that 
can improve random write performance and extend the 
lifetime of SSDs. Our work is motivated by LFS [32], 
which writes all modifications to disk sequentially in a 
log-like structure. In LFS, the segment cleaning over- 
head can severely degrade performance [35, 36] and 
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shorten the lifespan of an SSD. This is because quite 
a high number of pages need to be copied to secure a 
large empty chunk for a sequential write at every seg- 
ment cleaning. In designing SFS, we investigate how to 
take advantage of performance characteristics of SSD 
and I/O workload skewness to reduce the segment clean- 
ing overhead. 


This paper makes the following specific contributions: 


e We introduce the design principles for SSD-based 
file systems. The file system should exploit the per- 
formance characteristics of SSD and directly utilize 
file block level statistics. In fact, the architectural 
differences between SSD and HDD results in dif- 
ferent performance characteristics for each system. 
One interesting example is that, in SSD, the addi- 
tional overhead of random write disappears only 
when the unit size of random write requests be- 
comes a multiple of a certain size. To this end, we 
take log-structured approach with a carefully se- 
lected segment size. 

e To reduce the segment cleaning overhead in the 
log-structured approach, we propose an eager on 
writing data grouping scheme that classifies file 
blocks according to their update likelihood and 
writes those with similar update likelihoods into the 
same segment. The effectiveness of data grouping 
is determined by proper selection of the grouping 
criteria. For this, we propose an iterative segment 
quantization algorithm to determine the grouping 
criteria, while considering disk-wide hotness dis- 
tribution. We also propose cost-hotness policy for 
victim segment selection. Our eager data grouping 
will colocate frequently updated blocks in the same 
segments; thus most blocks in those segments are 
expected to become rapidly invalid. Consequently, 
the segment cleaner can easily find a victim seg- 
ment with few live blocks and thus can minimize 
the overhead of copying the live blocks. 

e Using a number of realistic and synthetic work- 
loads, we show that SFS significantly outperforms 
LES and state-of-the-art file systems such as ext4 
and btrfs. We also show that SFS can extend the 
lifespan of an SSD by drastically reducing the num- 
ber of NAND flash block erases. In particular, while 
the random write performance of the existing file 
systems is highly dependent on the random write 
performance of SSD, SFS can achieve nearly max- 
imum sequential write bandwidth of SSD for ran- 
dom writes at the file system level. Therefore, SFS 
can provide high performance even on mid-range 
or low-end SSDs as long as their sequential write 
performance is comparable to high-end SSDs. 


The rest of this paper is organized as follows. Sec- 
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tion 2 overviews the characteristics of SSD and I/O 
workloads. Section 3 describes the design of SFS in 
detail, and Section 4 shows the extensive evaluation of 
SFS. Related work is described in Section 5. Finally, in 
Section 6, we conclude the paper. 


2 Background 


2.1 Flash Memory and SSD Internals 


NAND flash memory is the basic building block of 
SSDs. Read and write operations are performed at the 
granularity of a page (e.g. 2 KB or 4 KB), and the 
erase operation is performed at the granularity of a block 
(composed of 64 — 128 pages). NAND flash memory dif- 
fers from HDDs in several aspects: (1) asymmetric speed 
of read and write operations, (2) no in-place overwrite — 
the whole block must be erased before overwriting any 
page in that block, and (3) limited program/erase cycles 
— a single-level cell (SLC) has roughly 100K erase cy- 
cles and a typical multi-level cell (MLC) has roughly 
10K erase cycles. 

A typical SSD is composed of host interface logic 
(SATA, USB, and PCI Express), an array of NAND flash 
memories, and an SSD controller. A flash translation 
layer (FTL) run by an SSD controller emulates an HDD 
by exposing a linear array of logical block addresses 
(LBAs) to the host. To hide the unique characteristics 
of flash memory, it carries out three main functions: (1) 
managing a mapping table from LBAs to physical block 
addresses (PBAs), (2) performing garbage collection to 
recycle invalidated physical pages, and (3) wear-leveling 
to wear out flash blocks evenly in order to extend the 
SSD’s lifespan. Agrawal et al. [2] comprehensively de- 
scribe the broad design space and tradeoffs of SSD. 

Much research has been carried out on FTL to 1m- 
prove performance and extend the lifetime [18, 24, 22, 
14]. In a block-level FTL scheme, a logical block num- 
ber is translated to a physical block number and the log- 
ical page offset within a block is fixed. Since the map- 
ping in this instance is coarse-grained, the mapping ta- 
ble is small enough to be kept in memory entirely. Un- 
fortunately, this results in a higher garbage collection 
overhead. In contrast, since a page-level FTL scheme 
manages a fine-grained page-level mapping table, it re- 
sults in a lower garbage collection overhead. However, 
such fine-grained mapping requires a large mapping ta- 
ble on RAM. To overcome such technical difficulties, 
hybrid FTL schemes [18, 24, 22] extend the block-level 
FTL. These schemes logically partition flash blocks into 
data blocks and log blocks. The majority of data blocks 
are mapped using block level mapping to reduce the re- 
quired RAM size and log blocks are mapped using page- 
level mapping to reduce the garbage collection overhead. 
Similarly, DFTL [14] extends the page-level mapping by 


USENIX Association 


USENIX Association 


SSS SSM [SSD 
"Manufacturer [| _Intel__| Samsung | Transcend | 
[Model SSSS~*YCX25-E_ S470 | JetFlash 700 
Capacity ———SS—S~—~—~—sY CGB | GBS 
Interface || SATA_| SATA_| USB30__ 


Flash Memory SLC MLC MLC 


Flash Memory || SLC [ MLC__[ MLC 
[Max Sequential Reads (MBI) _|[ 2169 [2126 | 691 
Random 4KB Reads (MBA) _|[ 138 [106 [53 
Max Sequential Writes (MB/S) |[170 [87 [38 





Random 4 KB Writes (MB/s) 0.002 


Pige GIB) 3 


Table 1: Specification data of the flash devices. List price 
is as of September 2011. 


selectively caching page-level mapping table entries on 
RAM. 


2.2 Imbalance between Random and Se- 
quential Write Performance in SSDs 


Most SSDs are built on an array of NAND flash memo- 
ries, which are connected to the SSD controller via mul- 
tiple channels. To exploit this inherent parallelism for 
better I/O bandwidth, SSDs perform read or write op- 
erations as a unit of a clustered page [19] that is com- 
posed of physical pages striped over multiple NAND 
flash memories. If the request size is not a multiple of 
the clustered page size, extra read or write operations 
are performed in the SSD and the performance is de- 
graded. Thus, the clustered page size is critical in I/O 
performance. 

Write performance in SSDs 1s highly workload depen- 
dent and is eventually limited by the garbage collection 
performance of FTL. Previous work [12, 9, 39, 37, 38] 
has reported that random write performance drops by 
more than an order of magnitude after extensive random 
updates and returns to the initial high performance only 
after extensive sequential writes. The reason for this is 
that random writes increase the garbage collection over- 
head in FTL. In a hybrid FTL, random writes increase 
the associativity between log blocks and data blocks, and 
incur the costly full merge [24]. In page-level FTL, as it 
tends to fragment blocks evenly, garbage collection has 
large copying overhead. 

In order to improve garbage collection performance, 
SSD combines several blocks striped over multiple 
NAND flash memories into a clustered block [19]. The 
purpose of this is to erase multiple physical blocks in 
parallel. If all write requests are aligned in multiples of 
the clustered block size and their sizes are also multiples 
of the clustered block size, random write updates and in- 
validates a clustered block as a whole. Therefore, in hy- 
brid FTL, a switch merge [24] with the lowest overhead 
occurs. Similarly, in page-level FTL, empty blocks with 
no live pages are selected as victims for garbage collec- 
tion. The result of this is that random write performance 
converges with sequential write performance. To ver- 


=—@=Sequential Write (SSD-H) 
—#= Sequential Write (SSD-M) 
=< Sequential Write (SSD-L) 


200 l 


=™=Random Write (SSD-H) 
=<—Random Write (SSD-M) 
=—@=—Random Write (SSD-L) 








b 
un 
Oo 




















50 + 


Throughput (MB/s) 
ay 
O° 
3° 


Request size 


Figure 1: Sequential vs. random write throughput. 
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Figure 2: Cumulative write frequency distribution. 


ify this, we measured sequential write and random write 
throughput on three different SSDs in Table 1, ranging 
from a high-end SLC SSD (SSD-H) to a low-end USB 
memory stick (SSD-L). To determine sustained write 
performance, dummy data equal to twice the device’s 
capacity is first written for aging, and the throughput of 
subsequent writing for 8GB is measured. Figure 1 shows 
that random write performance catches up with sequen- 
tial write performance when the request size is 16 MB or 
32 MB. These unique performance characteristics moti- 
vate the second design principle of SFS: write bandwidth 
maximization by sequential writes to SSD. 


2.3 Skewness in I/O Workloads 


Many researchers have pointed out that I/O workloads 
have non-uniform access frequency distribution [34, 31, 
23, 6, 3, 33, 11]. A disk-level trace of personal work- 
stations at Hewlett Packard laboratories exhibits a high 
locality of references in that 90% of the writes go to the 
1% of blocks [34]. Roselli et al. [31] analyzed file sys- 
tem traces collected from four different groups of ma- 
chines: an instructional laboratory, a set of computers 
used for research, a single web server, and a set of PCs 
running Windows NT. They found that files tend to be 
either read-mostly or write-mostly and the writes show 
substantial locality. Lee and Moon [23] showed that the 
update frequency of TPC-C workloads is highly skewed, 
in that 29% writes go to 1.6% of pages. 
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Bhadkamkar et al. [6] collected and investigated I/O 
traces of office and developer desktop workloads, a ver- 
sion control server, and a web server. Their analysis con- 
firms that the top 20% most frequently accessed blocks 
contribute to a substantially large (45-66%) percentage 
of total access. Moreover, high and low frequency blocks 
are spread over the entire disk area in most cases. Fig- 
ure 2 depicts the cumulative write frequency distribution 
of three real workloads: an IO trace collected by our- 
selves while running TPC-C [40] using Oracle DBMS 
(TPC-C), a research group trace (RES), and a web sever 
trace equipped with Postgres DBMS (WEB) collected 
by Roselli et al [31]. This observation motivates the third 
design principle of SFS: block grouping according to 
write frequency. 


3 Design of SFS 


SFS is motivated by a simple question: How can we 
utilize the performance characteristics of SSD and the 
skewness of the I/O workload in designing an SSD-based 
file system? In this section, we describe the rationale be- 
hind the design decisions in SFS, its system architecture, 
and several key techniques including hotness measure, 
segment quantization, segment writing, segment clean- 
ing and victim selection policy, and crash recovery. 


3.1 SFS: Design for SSD-based File Sys- 
tems of the 2010s 


Historically, existing file systems and modern SSDs 
have evolved separately without consideration of each 
other. With the exception of the recently introduced 
TRIM command, the two layers communicate with each 
other through simple read and write operations using 
only LBA information. For this reason, there are many 
impedance mismatches between the two layers, thus hin- 
dering the optimal performance when both layers are 
simply used together. In this section, we explain three 
design principles of SFS. First, we identify general per- 
formance problems when the existing file systems are 
running on modern SSDs and suggest that a file system 
should exploit the file block semantics directly. Second, 
we propose to take a log-structured approach based on 
the observation that the random write bandwidth is much 
slower than the sequential one. Third, we criticize that 
the existing /azy data grouping in LES during segment 
cleaning fails to fully utilize the skewness in write pat- 
terns and argue that an eager data grouping 1s necessary 
to achieve sharper bimodality in segments. In followings 
we will describe each principle in detail. 

File block level statistics — Beyond LBA: The exist- 
ing file systems perform suboptimally when running on 
top of SSDs with current FTL technology. This subopti- 
mal performance can be attributed to poor random write 
performance in SSDs. One of the basic functionalities of 
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file systems is to allocate an LBA for a file block. With 
regard to this LBA allocation, there have been two gen- 
eral policies in file system community: in-place-update 
and no-overwrite. The in-place-update file systems such 
as FAT32 [27] and ext4 [25] always overwrite a dirty file 
block to the same LBA so that the same LBA ever cor- 
responds to a file block unless the file frees the block. 
This unwritten assumption in file systems, together with 
the LBA level interface between file systems and storage 
devices, allows the underlying FTL mechanism in SSDs 
to exploit the overwrites against the same LBA address. 
In fact, most FTL research [24, 22, 13, 14] has focused 
on improving the random write performance based on 
the LBA level write patterns. Despite the relentless im- 
provement in FTL technology, the random write band- 
width in modern SSDs, as presented in Section 2.2, still 
lags far behind the sequential one. 

Meanwhile, several no-overwrite file systems have 
been implemented, such as birfs [10], ZFS [48], and 
WAFL [16], where dirty file blocks are written to new 
LBAs. These systems are known to improve scalabil- 
ity, reliability, and manageability [29]. In those systems, 
however, because the unwritten assumption between file 
blocks and their corresponding LBAs is broken, the FTL 
receives new LBA write request for every update of a file 
block and thus cannot exploit any file level hotness se- 
mantics for random write optimization. 

In summary, the LBA-based interface between the no- 
overwrite file systems and storage devices does not al- 
low the file blocks’ hotness semantic to flow down to 
the storage layer. In addition, the relatively poor random 
write performance in SSDs is the source of suboptimal 
performance in the in-place-update file systems. Conse- 
quently, we suggest that file systems should directly ex- 
ploit the hotness statistics at the file block level. This al- 
lows for optimization of the file system performance re- 
gardless of whether the unwritten assumption holds and 
how the underlying SSDs perform on random writes. 

Write bandwidth maximization by sequentialized 
writes to SSD: In Section 2.2, we show that the ran- 
dom write throughput becomes equal to the sequential 
write throughput only when the request size is a multiple 
of the clustered block size. To exploit such performance 
characteristics, SKS takes a log-structured approach that 
turns random writes at the file level into sequential writes 
at the LBA level. Moreover, in order to utilize nearly 
100% of the raw SSD bandwidth, the segment size is set 
to a multiple of the clustered block size. The result is that 
the performance of SFS will be limited by the maximum 
sequential write performance regardless of random write 
performance. 

Eager on writing data grouping for better bimodal 
segmentation: When there are not enough free seg- 
ments, a segment cleaner copies the live blocks from vic- 
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tim segments in order to secure free segments. Since seg- 
ment cleaning includes reads and writes of live blocks, it 
is the main source of overhead in any log-structured file 
system. Segment cleaning cost becomes especially high 
when cold data are mixed with hot data in the same seg- 
ment. Since cold data are updated less frequently, they 
are highly likely to remain live at the segment clean- 
ing and thus be migrated to new segments. If hot data 
and cold data are grouped into different segments, most 
blocks in the hot segment will be quickly invalidated, 
while most blocks in the cold segment will remain live. 
As aresult, the segment utilization distribution becomes 
bimodal: most of the segments are almost either full or 
empty of live blocks. The cleaning overhead is drasti- 
cally reduced, because the segment cleaner can almost 
always work with nearly empty segments. To form a bi- 
modal distribution, LFS uses a cost-benefit policy [32] 
that prefers cold segments over hot segments. However, 
LFS writes data regardless of hot/cold and then tries to 
separate data lazily on segment cleaning. If we can cate- 
gorize hot/cold data when it is first written, there is much 
room for improvement. 

In SES, we classify data on writing based on file block 
level statistics as well as segment cleaning. In such early 
data grouping, since segments are already composed 
of homogeneous data with similar update likelihood, 
we can significantly reduce segment cleaning overhead. 
This is particularly effective because I/O skewness is 
common in real world workloads, as shown in Sec- 
tion 2.3. 


3.2 SES Architecture 


SFS has four core operations: segment writing, segment 
cleaning, reading, and crash recovery. Segment writing 
and segment cleaning are particularly crucial for perfor- 
mance optimization in SFS, as depicted in Figure 3. Be- 
cause the read operation in SFS is same as that of ex- 
isting log-structured file systems, we will not cover its 


detail in this paper. 

As a measure for representing the future update like- 
lihood of data in SFS, we define hotness for file block, 
file, and segment, respectively. As the hotness is higher, 
the data is expected to be updated sooner. The first step 
of segment writing in SFS is to determine the hotness 
criteria for block grouping. This is, in turn, determined 
by segment quantization that quantizes a range of hot- 
ness values into a single hotness value for a group. For 
the sake of brevity, it is assumed throughout this paper 
that there are four segment groups: hot, warm, cold, and 
read-only groups. The second step of segment writing is 
to calculate the block hotness for each block and assign 
them to the nearest quantized group by comparing the 
block hotness and the group hotness. At this point, those 
blocks with similar hotness levels should belong to the 
Same group (i.e. their future update likelihood is simi- 
lar). As the final step of segment writing, SFS always 
fills a segment with blocks belonging to the same group. 
If the number of blocks in a group is not enough to fill 
a segment, the segment writing of the group is deferred 
until the segment is filled. This eager grouping of file 
blocks according to the hotness measure serves to colo- 
cate blocks with similar update likelihoods in the same 
segment. Therefore, segment writing in SFS 1s very ef- 
fective at achieving sharper bimodality in segment uti- 
lization distribution. 

Segment cleaning in SFS consists of three steps: se- 
lect victim segments, read the live blocks in victim seg- 
ments into the page cache and mark the live blocks as 
dirty, and trigger the writing process. The writing pro- 
cess treats the live blocks from victim segments the same 
as normal blocks; each live block is classified into a spe- 
cific quantized group according to its hotness. After all 
the live blocks are read into the page cache, the victim 
segments are then marked as free so that they can be 
reused for writing. For better victim segment selection, 
cost-hotness policy is introduced, which takes into ac- 
count both the number of live blocks in segment (i.e. 
cost) and the segment hotness. 

In the following sections, we will explain each com- 
ponent of SFS in detail: how to measure hotness (8 3.3), 
segment quantization (§ 3.4), segment writing (§ 3.5), 
segment cleaning (§ 3.6), and crash recovery (8 3.7). 


3.3. Measuring Hotness 


In SES, hotness is used as a measure of how likely the 
data is to be updated. Hotness is defined for file block, 
file, and segment, respectively. Although it is difficult 
to estimate data hotness without prior knowledge of fu- 
ture access pattern, SFS exploits both the skewness and 
the temporal locality in the I/O workload so as to esti- 
mate the update likelihood of data. From the skewness 
observed in many workloads, frequently updated data 
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tends to be updated quickly. Moreover, because of the 
temporal locality in references, the recently updated data 
is likely to be changed quickly. Thus, using the skewness 
and the temporal locality, hotness is defined as ae 
Each hotness of file block, file, and segment is specifi- 
cally defined as follows. 

First, block hotness Hy is defined by age and write 
count of a block as follows: 


Wo ‘ 
H, = me if W, > 0, 


Hy otherwise. 
where 7’ is the current time, 7; is the last modified time 
of the block, and W, is the total number of writes on the 
block since the block was created. If a block is newly 
created (W, = 0), Hy is defined as the hotness of the 
file that the block belongs to. 

Next, file hotness Hy is used to estimate the hotness 
of a newly created block. It is defined by age and write 
count of a file as follows: 


at 7 i 

 <e 
where ‘J’? 1s the last modified time of the file, and W¢ 
is the total number of block updates since the file was 
created. 

Finally, segment hotness represents how likely a seg- 
ment is to be updated. Since a segment is a collection 
of blocks, it is reasonable to derive its hotness from the 
hotness of live blocks contained within. That is, as the 
hotness of live blocks in a segment is higher, the seg- 
ment hotness also becomes higher. Therefore, we define 
hotness of a segment H, as the average hotness of the 
live blocks in the segment. However, it is expensive to 
calculate H, because the liveness of all blocks in a seg- 
ment must be tested. To determine H, for all segments 
in a disk, the liveness of all blocks in the disk must be 
checked. To alleviate this cost, we approximately calcu- 
late the average hotness of live blocks in a segment as 
follows: 


1 
H, = ND He 


mean of write count of live blocks 


mean of age of live blocks 


_ i Ws: 
Re 


where NV is the number of live blocks in a segment, 
Aly,, 1;"", and W), are block hotness, last modified time, 
and write count of 2-th live block, respectively. When 
a segment is created, SFS stores )/, T,” and > Wo; 
in the segment usage meta-data file (SUFILE), and up- 
dates them by subtracting 7;”” and Wy, whenever a block 
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Figure 4: Example of segment quantization. 


is invalidated. Using this approximation, we can incre- 
mentally calculate H, of a segment without checking the 
liveness of blocks in the segment. We will elaborate on 
how to manage meta-data for hotness in Section 4.1. 


3.4 Segment Quantization 


In order to minimize the overhead of copying the live 
blocks during segment cleaning, it is crucial for SFS to 
properly group blocks according to hotness and then to 
write them in grouped segments. The effectiveness of 
block grouping is determined by the grouping criteria. 
In fact, improper criteria may colocate blocks from dif- 
ferent groups into the same segment, thus deteriorating 
the effectiveness of grouping. Ideally, grouping criteria 
should consider the distribution of all blocks’ hotness 
in the file system, yet in reality this is too costly. Thus, 
we instead use segment hotness as an approximation of 
block hotness and devise an algorithm to calculate the 
criterion, iterative segment quantization. 

In SFS, segment quantization 1s a process used to par- 
tition the hotness range of a file system into k sub-ranges 
and calculate a quantized value for each sub-range rep- 
resenting a group. There are many alternative ways to 
quantize hotness. For example, each group can be quan- 
tized using egui-height partitioning or equi-width par- 
titioning. Equi-height partitioning equally divides the 
whole hotness range into multiple groups and equi-width 
partitioning makes each group have an equal number of 
segments. In Figure 4, the segment hotness distribution 
is computed by measuring the hotness for all segments 
on the disk after running TPC-C workload under 70% 
disk utilization. In such a distribution where most seg- 
ments are not hot, however, both approaches fail to cor- 
rectly reflect the hotness distribution and the resulting 
group quantization is suboptimal. 

In order to correctly reflect the hotness distribution of 
segments and to properly quantize them, we propose an 
iterative segment quantization algorithm. Inspired by the 
data clustering approach in statistics domain [15], our 
iterative segment quantization partitions segments into 
k; groups and tries to find the centers of natural groups 
through an iterative refinement approach. A detailed de- 
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scription of the algorithm is as follows: 


1. If the number of written segments is less than or 
equal to k, assign a randomly selected segment hot- 
ness to initial value of H,,, which denotes hotness 
of the 2-th group. 

2. Otherwise update H,, as follows: 


(a) Assign each segment to the group G; whose 
hotness is closest to the segment hotness. 
G; = {H,, : ||Hs, — H,|| < ||Hs, 
for alli* =1,...,k} 








a Alg,. 








(b) Calculate the new means to be the group hot- 
ness Ho,. 





1 


a = 
"|G; 


Dd, He; 


As, EG; 


3. Repeat Step 2 until H,, no longer changes or three 
times at most. 


Despite the fact that its computational overhead in- 
creases in proportion to the number of segments, the 
large segment size means that the overhead of the pro- 
posed algorithm is reasonable (32 segments for 1 GB 
disk space given 32 MB segment size). To further reduce 
the overhead, SFS stores H,, in meta-data and reloads 
them at mounting for faster convergence. 


3.5 Segment Writing 


As illustrated in Figure 3, segment writing in SFS con- 
sists of two sequential steps: one to group dirty blocks in 
the page cache and the other to write the blocks group- 
wise in segments. Segment writing is invoked in four 
cases: (a) SFS periodically writes dirty blocks every five 
seconds, (b) flush daemon forces a reduction in the num- 
ber of dirty pages in the page cache, (c) segment clean- 
ing occurs, and (d) an fsync or sync occurs. The first 
step of segment writing is segment quantization: all Ho, 
are updated as described in Section 3.4. Next, the block 
hotness Hy of every dirty block is calculated, and each 
block is assigned to the group H,, whose hotness 1s clos- 
est to the block hotness. 

To avoid blocks in different groups being colocated in 
the same segment, SFS completely fills a segment with 
blocks from the same group. In other words, among all 
groups, only the groups large enough to completely fill a 
segment are written. Thus, when the group size, 1.e. the 
number of blocks belonging to a group, is less than the 
segment size, SFS will defer writing the blocks to the 
segment until the group size reaches the segment size. 
However, when an fsync or sync occurs or SFS initiates 
a check-point, every dirty block including the deferred 
blocks should be immediately written to segment regard- 
less of the group size. In this case, we take a best-effort 


approach: at first, writing out blocks groupwise as many 
as possible, then writing only the remaining blocks re- 
gardless of group. In all cases, writing a block accom- 
panies updating relevant meta-data, 7;", Wy, 14", Ws, 
> Lp", and >), W,,, and invalidating the liveness of 
the overwritten block. Since the writing process contin- 
uously reorganizes file blocks according to hotness, it 
helps to form sharp bimodal distribution of segment uti- 
lization, and thus to reduce the segment cleaning over- 
head. Further, it almost always generates aligned large 
sequential write requests that are optimal for SSD. 

Because the blocks under segment cleaning are han- 
dled similarly, their writing can also be deferred if the 
number of live blocks belonging to a group is not enough 
to completely fill a segment. As such, there is a danger 
that the not-yet-written blocks under segment cleaning 
might be lost if the originating segments of the blocks 
are already overwritten by new data but a system crash 
or a sudden power off is encountered. To cope with such 
data loss, two techniques are introduced. First, SFS man- 
ages a free segment list and allocates segments in the 
least recently freed (LRF) order. Second, SFS checks 
whether writing a normal block could cause a not-yet- 
written block under segment cleaning to be overwritten. 
Let S‘ denote a newly allocated segment and S*t! de- 
note a segment that will be allocated in next segment 
allocation. If there are not-yet-written blocks under seg- 
ment cleaning that originate in S’*!, SFS writes such 
blocks to S* regardless of grouping. This guarantees 
that not-yet-written blocks under segment cleaning are 
never overwritten before they are written elsewhere. The 
segment-cleaned blocks are thus never lost, even in a 
system crash or a sudden power off, because they al- 
ways have an on-disk copy. The LRF allocation scheme 
increases the opportunity for a segment-cleaned block 
to be written by block grouping rather than this scheme. 
The details of minimizing the overhead in this process 
are omitted from this paper. 


3.6 Segment Cleaning: Cost-hotness policy 


In any log-structured file system, the victim selection 
policy is critical to minimizing the overhead of segment 
cleaning. There are two well-known segment clean- 
ing policies: greedy policy [32] and cost-benefit policy 
[32, 17]. Greedy policy [32] always selects segments 
with the smallest number of live blocks, hoping to re- 
claim as much space as possible with the least copying 
out overhead. However, it does not consider the hotness 
of data blocks during segment cleaning. In practice, be- 
cause the cold data tends to remain unchanged for a long 
time before it becomes invalidated, it would be very ben- 
eficial to separate cold data from hot data. To this end, 
cost-benefit policy [32, 17] prefers cold segments to hot 
segments when the number of live blocks is equal. Even 
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though it is critical to estimate how long a segment re- 
mains unchanged, cost-benefit policy simply uses the 
last modified time of any block in the segment (i.e. the 
age of the youngest block) as a simple measure of the 
segment’s update likelihood. 

As a natural extension of cost-benefit policy, we intro- 
duce cost-hotness policy; since hotness in SFS directly 
represents the update likelihood of segment, we use seg- 
ment hotness instead of segment age. Thus, SFS chooses 
a victim among the segments, which maximizes the fol- 
lowing formula: 

free space generated 


cost-hotness = 
cost « segment hotness 


__ (1 a Us) 
 2U.H. 


where U, is segment utilization, i.e. the fraction of the 
live blocks in a segment. The cost of collecting a seg- 
ment is 2U, (one U, to read valid blocks and the other 
U., to write them back). Although cost-hotness policy 
needs to access the utilization and the hotness of all seg- 
ments, it is very efficient because our implementation 
keeps them in segment usage meta-data file (SUFILE) 
and meta-data size per segment is quite small (48 bytes 
long). All segment usage information is very likely to be 
cached in memory and can be accessed without access- 
ing the disk in most cases. We will describe the detail of 
meta-data management in Section 4.1. 

In SFS, the segment cleaner is invoked when the disk 
utilization exceeds a water-mark. The water-mark for 
the our experiments is set to 95% of the disk capacity 
and the segment cleaning is allowed to process up to 
three segments at once (96 MB given the segment size of 
32 MB). The prototype did not implement the idle time 
cleaning scheme suggested by Blackwell et al. [7], yet 
this could be seamlessly integrated with SFS. 


3.7 Crash Recovery 


Upon a system crash or a sudden power off, the in 
progress write operations may leave the file system in- 
consistent. This is because dirty data blocks or meta- 
data blocks in the page cache may not be safely writ- 
ten to the disk. In order to restore such inconsistencies 
from failures, SFS uses a check-point mechanism; on re- 
mounting after a crash, the file system is rolled back to 
the last check-point state, and then resumes in a normal 
manner. A check-point is the state in which all of the file 
system structures are consistent and complete. In SFS, a 
check-point is carried out in two phases; first, it writes 
out all the dirty data and meta-data to the disk, and then 
updates the superblock in a special fixed location on the 
disk. The superblock keeps the root address of the meta- 
data, the position in the last written segment and time- 
stamp. SFS can guarantee the atomic write of the su- 
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perblock by alternating between writing it to two sep- 
arate physical blocks on the disk. During re-mounting, 
SFS reads both copies of the superblock, compares their 
time stamps and uses the more recent one. 

Frequent check-pointing can minimize data loss from 
crashes but can hinder normal system performance. Con- 
sidering this trade-off, SFS performs a check-point in 
four cases: (a) every thirty seconds after creating a 
check-point, (b) when more than 20 segments (640 MB 
given a segment size of 32 MB) are written, (c) when 
performing sync or fsync operation, and (d) when the file 
system is unmounted. 


4 Evaluation 


4.1 Experimental Systems 


Implementation: SFS is implemented based on 
NILFS2 [28] by retrofitting the in-memory and on- 
disk meta-data structures to support block grouping and 
cost-hotness segment cleaning. NILFS2 in the mainline 
Linux kernel is based on log-structured file system [32] 
and incorporates advanced features such as b-tree based 
block management for scalability and continuous snap- 
shot [20] for ease of management. 

Implementing SFS requires a significant engineering 
effort, despite the fact that it is based on the already ex- 
isting NILFS2. NILFS2 uses b-tree for scalable block 
mapping and virtual-to-physical block translation in data 
address translation (DAT) meta-data file to support con- 
tinuous snapshot. One technical issue of b-tree based 
block mapping is the excessive meta-data update over- 
head. If a leaf block in a b-tree is updated, its effect is 
always propagated up to the root node and all the corre- 
sponding virtual-to-physical entries in the DAT are also 
updated. Consequently, random writes entail a signifi- 
cant amount of meta-data updates — writing 3.2 GB 
with 4 KB I/O unit generates 3.5 GB of meta-data. To 
reduce this meta-data update overhead and support the 
check-point creation policy discussed in Section 3.7, we 
decided to cut off the continuous snapshot feature. In- 
stead, SFS-specific fields are added to several meta-data 
structures: superblock, inode file (IFILE), segment us- 
age file (SUFILE), and DAT file. Group hotness H,,, 1s 
stored in the superblock and loaded at mounting for the 
iterative segment quantization. Per file write count W 
and the last modified time Jy" are stored in the IFILE. 
The SUFILE contains information for hotness calcula- 
tion and segment cleaning: U;, H;, >), T;” and >|, Wo,. 
Per-block write count W;, and the last modified time 
T,” are stored in the DAT entry along with virtual-to- 
physical mapping. Of these, W;, and 7;” are the largest, 
each being eight bytes long. Since the meta-data fields 
for continuous snapshot in the DAT entry have been re- 
moved, the size of the DAT entry in SFS is the same as 
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that of NILFS2 (32 bytes). As a result of these changes, 
we reduce the runtime overhead of meta-data to 5%- 
10% for the workloads used in our experiments. In SFS, 
since a meta-data file is treated the same as a normal file 
with a special inode number, a meta-data file can also be 
cached in the page cache for efficient access. 

Segment cleaning in NILFS2 is not elaborated to the 
state-of-the-art in academia. It takes simple time-stamp 
policy [28] that selects the oldest dirty segment as a vic- 
tim. For SFS, we implemented the cost-hotness policy 
and segment cleaning triggering policy described in Sec- 
tion 3.6. 

In our implementation, we used Linux kernel 2.6.37, 
and all experiments are performed on a PC using a 2.67 
GHz Intel Core 15 quad-core processor with 4 GB of 
physical memory. 

Target SSDs: Currently, the spectrum of SSDs avail- 
able in the market is very wide in terms of price and per- 
formance; flash memory chips, RAM buffers, and hard- 
ware controllers all vary greatly. For this paper, we se- 
lect three state-of-the-art SSDs as shown in Table 1. The 
high-end SSD is based on SLC flash memory and the 
rest are based on MLC. Hereafter, these three SSDs are 
referred to as SSD-H, SSD-M, and SSD-L ranging from 
high-end to low-end. 

Figure 1 shows sequential vs. random write through- 
put of the three devices. The request sizes of random 
write whose bandwidth converges to that of sequential 
write are 16 MB, 32 MB, and 16 MB for SSD-H, SSD- 
M, and SSD-L, respectively. To fully exploit device per- 
formance, the segment size is set to 32 MB for all three 
devices. 

Workloads: To study the impact of SFS on various 
workloads, we use a mixture of synthetic and real-world 
workloads. Two real-world file system traces are used 
in our experiments: OLTP database workload, and desk- 
top workload. For OLTP database workload, the file sys- 
tem level trace is collected while running TPC-C [40]. 
The database server runs Oracle 11g DBMS and the 
load server runs Benchmark Factory [30] using TPC- 
C benchmark scenario. For desktop workload, we used 
RES from the University of California at Berkeley [31]. 
RES is a research workload collected for 113 days on a 
system consisting of 13 desktop machines of a research 
group. In addition, two traces of random writes with 
different distributions are generated as synthetic work- 
loads: one with Zipfian distribution and the other with 
uniform random distribution. The uniform random write 
is the workload that shows the worst case behavior of 
SES, since SES tries to utilize the skewness in workloads 
during block grouping. 

Since our main area of interest is in maximum write 
performance, write requests in the workloads are re- 
played as fast as possible in a single thread and through- 
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Figure 5: Write cost vs. number of group. Disk utiliza- 
tion is 85%. 


put is measured at the application level. Native Com- 
mand Queuing (NCQ) is enabled to maximize the par- 
allelism in the SSD. In order to explore the system be- 
havior on various disk utilizations, we sequentially filled 
the SSD with enough dummy blocks, which are never 
updated after creation, until the desired utilization is 
reached. Since the amount of the data block update is 
the same for a workload regardless of the disk utiliza- 
tion, the amount of the meta-data update is also the same. 
Therefore, in our experiment results, we can directly 
compare performance metrics for a workload regardless 
of the disk utilization. 

Write Cost: To write new data in SFS, a new seg- 
ment is generated by the segment cleaner. This cleaning 
process will incur additional read and write operations 
for the live blocks being segment-cleaned. Therefore, the 
write cost of data should include the implicit I/O cost of 
segment cleaning as well as the pure write cost of new 
data. In this paper, we define the write cost W,. to com- 
pare the write cost induced by the segment cleaning. It 
is defined by three component costs — the write cost of 
new data W2’°, the read and the write cost of the data 
being segment-cleaned, Re° and W 2° — as follows: 


Woe + Ee + Wee 


W.= 
Veen 


Each component cost is defined by division of the 
amount of I/O by throughput. Since the unit of write 
in SFS is always a large sequential chunk, we choose 
the maximum sequential write bandwidth in Table | for 
throughputs of W2° and W2°°". Meanwhile, since the 
live blocks being segment-cleaned are assumed to be 
randomly located in a victim segment, the 4 KB ran- 
dom read bandwidth in Table 1 is selected for the read 
throughput of R2°. Throughout this paper, we measured 
the amount of I/O while replaying the workload trace 
and thus calculated the write cost for a workload. 


4.2 Effectiveness of SFS Techniques 

As discussed in Section 3, the key techniques of SFS 
are (a) on writing block grouping, (b) iterative segment 
quantization, and (c) cost-hotness segment cleaning. To 
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Figure 6: Write costs of quantization schemes. Disk uti- 
lization is 85%. 
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Figure 7: Write cost vs. segment cleaning scheme. Disk 
utilization is 85%. 


examine how each technique contributes to the overall 
performance, we measured the write costs of Zipf and 
TPC-C workload under 85% disk utilization on SSD-M. 

First, to verify how the block grouping is effective, 
we measured the write costs by varying the number of 
groups from one to six. As shown in Figure 5, we can 
observe that the effect of block grouping is consider- 
able. When the blocks are not grouped (i.e. the num- 
ber of groups is 1), the write cost is fairly high: 6.96 
for Zipf and 5.98 for TPC-C. Even when the number of 
groups increases to two or three, no significant reduction 
in write cost is observed. However, when the number of 
groups reaches four the write costs of Zipf and TPC-C 
workloads significantly drop to 4.21 and 2.64, respec- 
tively. In the case of five or more groups, the write cost 
reduction is marginal. The additional groups do not help 
much when there are already enough groups covering 
hotness distribution, but may in fact increase the write 
cost. Since more blocks can be deferred due to insuffi- 
cient blocks in a group, this could result in more blocks 
being written without grouping when creating a check- 
point. 

Next, we compared the write cost of the different seg- 
ment quantization schemes across four groups. Figure 6 
shows that our iterative segment quantization reduces 
the write costs significantly. The equi-width partition- 
ing scheme has the highest write cost; 143% for Zipf 
and 192% for TPC-C over the iterative segment quan- 
tization. The write costs of the equi-height partitioning 
scheme are 115% for Zipf and 135% for TPC-C over the 
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iterative segment quantization. 

Finally, to verify how cost-hotness policy affects per- 
formance, we compared the write cost of cost-hotness 
policy and cost-benefit policy with the iterative segment 
quantization for four groups. As shown in Figure 7, cost- 
hotness policy can reduce the write cost by approxi- 
mately 7% over for both TPC-C and Zipf workload. 


4.3 Performance Evaluation 


4.3.1 Write Cost and Throughput 


To show how SES and LFS perform against various 
workloads with different write patterns, we measured 
their write costs and throughput for two synthetic work- 
loads and two real workloads, and presented the perfor- 
mance results in Figure 8 and 9. For LFS, we imple- 
mented the cost-benefit cleaning policy in our code base 
(hereafter LFS-CB). Since throughput is measured at the 
application level, it includes the effects of the page cache 
and thus can exceed the maximum throughput of each 
device. Due to space constraints, only the experiments 
on SSD-M are shown here. The performance of SFS on 
different devices is shown in Section 4.3.3. 

First, let us explain how much SFS can improve the 
write cost. It is clear from Figure 8 that SFS significantly 
reduces the write cost compared to LFS-CB. In partic- 
ular, the relative write cost improvement of SFS over 
LFS-CB gets higher as disk utilization increases. Since 
there is not enough time for the segment cleaner to re- 
organize blocks under high disk utilization, our on writ- 
ing data grouping shows greater effectiveness. For the 
TPC-C workload which has high update skewness, SFS 
reduces the write cost by 77.4% under 90% utilization. 
Although uniform random workload without skewness 
is a worst case workload, SFS reduces the write cost by 
27.9% under 90% utilization. This shows that SFS can 
effectively reduce the write cost for a variety of work- 
loads. 

To see if the lower write costs in SFS will result in 
higher performance, throughput is also compared. As 
Figure 9 shows, SFS improves throughput of the TPC-C 
workload by 151.9% and that of uniform random work- 
load by 18.5% under 90% utilization. It shows that the 
write cost reduction in SFS actually results in perfor- 
mance improvement. 


4.3.2 Segment Utilization Distribution 


To further study why SES significantly outperforms 
LFS-CB, we also compared the segment utilization dis- 
tribution of SFS and LFS-CB. Segment utilization 1s cal- 
culated by dividing the number of live blocks in the 
segment by the number of total blocks per segment. 
After running a workload, the distribution is computed 
by measuring the utilizations of all non-dummy seg- 
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ments on the SSD. Since SFS continuously re-groups 
data blocks according to hotness, it is likely that a sharp 
bimodal distribution is formed. Figure 10 shows the 
segment utilization distribution when disk utilization is 
70%. We can see the obvious bimodal segment distri- 
bution in SFS for all workloads except for the skewless 
uniform random workload. Even in the uniform random 
workload, the segment utilization of SFS is skewed to 
lower utilization. Under such bimodal distribution, the 
segment cleaner can select as victims those segments 
with few live blocks. For example, as shown in Fig- 
ure 10a, SFS will select a victim segment with 10% uti- 
lization, while LFS-CB will select a victim segment with 
30% utilization. In this case, the number of live blocks of 
a victim in SFS is just one-third of that in LFS-CB, thus 
the segment cleaner copies only one-third the amount 
of blocks. The reduced cleaning overhead results in a 
significant performance gap between SFS and LFS-CB. 
This experiment shows that SKS forms a sharp bimodal 
distribution of segment utilization by data block group- 
ing, and reduces the write cost. 


4.3.3 Effects of SSD Performance 


In the previous sections, we showed that SFS can sig- 
nificantly reduce the write cost and drastically im- 
prove throughput on SSD-M. As shown in Section 2.2, 
SSDs have various performance characteristics. To see 
whether SFS can improve the performance on various 
SSDs, we compared throughput of the same workloads 
on SSD-H, SSD-M, and SSD-L in Figure 11. As shown 
in Table 1, SSD-H is ten-fold more expensive than SSD- 
L, the maximum sequential write performance of SSD- 
H is 4.5 times faster than SSD-L, and the 4 KB random 
write performance of SSD-H is more than 2,500 times 
faster than SSD-L. Despite the fact that these three SSDs 
show such large variances in performance and price, 
Figure 11 shows that SFS performs regardless of the 
random write performance. The main limiting factor is 
the maximum sequential write performance. This is be- 
cause, except for updating superblock, SFS always gen- 
erates large sequential writes to fully exploit the max- 
imum bandwidth of SSD. The experiment shows that 
SFS can provide high performance even on mid-range 
or low-end SSD only if sequential write performance is 
high enough. 


4.4 Comparison with Other File Systems 


Up to now, we have analyzed how SES performs un- 
der various environments with different workloads, disk 
utilization, and SSDs. In this section, we compared the 
performance of SFS using three other file systems, each 
with different block update policies: LFS-CB for log- 
ging policy, ext4 [25] for in-place-update policy, and 
btrfs [10] for no-overwrite policy. To enable btrfs’ SSD 
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Figure 12: Throughput under different file systems. 


optimization, btrfs was mounted in SSD mode. The 
in-place-update mode of btrfs is also tested with the 
nodatacow option enabled to further analyze the be- 
havior of btrfs (hereafter btrfs-nodatacow). Four work- 
loads were run on SSD-M with 85% disk utilization. To 
obtain the sustained performance, we measured 8 GB 
writing after 20 GB writing for aging. 

First, we compared throughput of the file systems in 
Figure 12. SFS significantly outperforms LFS-CB, ext4, 
btrfs, and btrfs-nodatacow for all four workloads. The 
average throughputs of SFS are higher than those of 
other file systems: 1.6 times for LFS-CB, 7.3 times for 
btrfs, 1.5 times for btrfs-nodatacow, and 1.5 times for 
ext4. 

Next, we compared the write amplification that repre- 
sents the garbage collection overhead inside SSD. We 
collected I/O traces issued by the file systems using 
b1lktrace [8] while running four workloads, and the 
traces were run on an FTL simulator, which we imple- 
mented, with two FTL schemes — (a) FAST [24] as a rep- 
resentative hybrid FTL scheme and (b) page-level FTL 
[17]. In both schemes, we configure a large block 32 GB 
NAND flash memory with 4 KB page, 512 KB block, 
and 10% over-provisioned capacity. Figure 13 shows 
write amplifications in FAST and page-level FTL for 
the four workloads processed by each file system. In all 
cases, write amplifications of log-structured file systems, 
SFS and LFS-CB, are very low: 1.1 in FAST and 1.0 
in page-level FTL on average. This indicates that both 
FTL schemes generate 10% or less additional writings. 
Log-structured file systems collect and transform ran- 
dom writes at file level to sequential writes at LBA level. 
This results in optimal switch merge [24] in FAST and 
creates large chunks of contiguous invalid pages in page- 
level FTL. In contrast, in-place-update file systems, ext4 
and btrfs-nodatacow, have the largest write amplifica- 
tion: 5.3 in FAST and 2.8 in page-level FTL on average. 
Since in-place-update file systems update a block in- 
place, random writes at file-level result in random writes 
at LBA-level. This contributes to high write amplifica- 
tion. Meanwhile, because btrfs never overwrites a block 
and allocates a new block for every update, it is likely to 
lower the average write amplification: 2.8 in FAST and 
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Figure 13: Write amplification with different FTL schemes. 
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Figure 14: Number of erases with different FTL schemes. 


1.2 in page-level FTL on average. 

Finally, we compared the number of block erases that 
determine the lifespan of SSD in Figure 14. As can 
be expected from the write amplification analysis, the 
number of block erases in SFS and LFS-CB are signifi- 
cantly lower than in all others. Since the segment clean- 
ing overhead of SFS is lower than that of LFS-CB, the 
number of block erases in SFS is smallest: LFS-CB in- 
curs totally 20% more block erases in FAST and page- 
level FTL. Erase counts of overwrite file systems, ext4 
and btrfs-nodatacow, are significantly higher than that 
of SKS. In total, ext4 incurs 3.1 times more block erases 
in FAST and 1.8 times more block erases in page-level 
FTL. Similarly, total erase counts of btrfs-nodatacow are 
3.4 times higher in FAST and 2.0 times higher in page- 
level FTL. Interestingly, btrfs incurs the largest number 
of block erases: in total, 6.1 times more block erases 
in FAST and 3.8 times more block erases in page-level 
FTL, and in worst case 7.5 times more block erases than 
SFS. Although the no-overwrite scheme in btrfs incurs 
lower write amplification compared to ext4 and btrfs- 
nodatacow, btrfs shows large overhead to support copy- 
on-write and manage fragmentation [21, 46] induced by 
random writes at file-level. 

In summary, the erase count of the in-place-update 
file system is high because of high write amplification. 
That of the no-overwrite file system is also high due 
to the number of write requests from the file system, 
even at relatively low write amplification. The major- 


ity of the overhead comes from supporting no-overwrite 
and handling fragmentation in the file system. Frag- 
mentation of the no-overwrite file system under ran- 
dom write is a widely known problem [21, 46]: succes- 
sive random writes eventually move all blocks into ar- 
bitrary positions, and this makes all I/O access random 
at the LBA level. Defragmentation, which is similar to 
segment cleaning in a log-structured file system, is im- 
plemented [21, 1] to reduce the performance problem 
of fragmentation. Similarly to segment cleaning, it also 
has additional overhead to move blocks. In case of log- 
structured file systems, if we carefully choose segment 
size to be aligned with the clustered block size, write 
amplification can be minimal. In this case, the segment 
cleaning overhead is the major overhead that increases 
the erase count. SFS is shown to drastically reduce the 
segment cleaning overhead. It can also be seen that the 
write amplification and erase count of SFS are signifi- 
cantly lower than for all other file systems. Therefore, 
SFS can significantly increase the lifetime as well as the 
performance of SSDs. 


5 Related Work 


Flash memory based storage systems and log-structured 
techniques have received a lot of interests in both 
academia and industry. Here we only present the papers 
most related to our work. 

FTL-level approaches: There are many FTL-level 
approaches to improve random write performance. 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


131 


152 


Among hybrid FTL schemes, FAST [24] and LAST 
[22] are representative. FAST [24] enhances random 
write performance by improving the log area utilization 
with flexible mapping in log area. LAST [22] further 
improves FAST [24] by separating random log blocks 
into hot and cold regions to reduce the full merge cost. 
Among page-level FTL schemes, DAC [13] and DFTL 
[14] are representative. DAC [13] clusters data blocks of 
the similar write frequencies into the same logical group 
to reduce the garbage collection cost. DFTL [14] reduces 
the required RAM size for the page-level mapping table 
by using dynamic caching. FTL-level approaches exhibit 
a serious limitation in that they depend almost exclu- 
sively on LBA to decide sequentiality, hotness, cluster- 
ing, and caching. Such approaches deteriorate when a 
file system adopts a no-overwrite block allocation pol- 
Icy. 

Disk-based log-structured file systems: There is 
much research to optimize log-structured file systems 
on conventional hard disks. In the hole plugging method 
[44], the valid blocks in victim segments are overwritten 
to the holes, 1.e. invalid blocks, in other segments with 
a few invalid blocks. This reduces the copying cost of 
valid blocks in segment cleaning. However, this method 
is beneficial only under a storage media that allows in- 
place updates. Matthews et al. [26] proposed the adap- 
tive method that combines cost-benefit policy and hole- 
plugging. It first estimates the cost of cost-benefit pol- 
icy and hole-plugging respectively, and then adaptively 
selects the policy with the lower cost. However, their 
cost model is based on the performance characteristics 
of HDD, seek and rotational delay. WOLF [42] sepa- 
rates hot pages and cold pages into two different seg- 
ment buffers according to the update frequency of data 
pages, and writes two segments to disk at once. This sys- 
tem works well only when hot pages and cold pages are 
roughly half and half, so that they can be separated into 
two segments. HyLog [43] uses a hybrid approach: log- 
ging for hot pages to achieve high write performance and 
overwrite for cold pages to reduce the segment cleaning 
cost. In HyLog, it is critical to estimate the ratio of hot 
pages to determine the update policy. However, similar 
to the adaptive method, its cost model is based on the 
performance characteristics of HDD. 


Flash-based log-structured file systems: In embed- 
ded systems with limited CPU and main memory, spe- 
cially designed file systems that directly access raw 
flash devices are commonly used. To handle the unique 
characteristics of flash memory including no in-place- 
update, wear-leveling and bad block management, these 
systems take the log-structured approach. JFFS2 [45], 
YAFFS2 [47], and UBIFS [41] are widely used flash- 
based log-structured file systems. In terms of segment 
cleaning, each uses a turn-based selection algorithm 
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[45, 47, 41] that incorporates wear-leveling into the 
segment cleaning process. This consists of two phases, 
namely X and Y turns. In the X turn, it selects a victim 
segment using greedy policy without considering wear- 
leveling. During the Y turn, it probabilistically selects a 
full valid segment as a victim block for wear-leveling. 


6 Conclusion and Future Work 


In this paper, we proposed a next generation file system 
for SSD, SES. It takes a log-structured approach which 
transforms the random writes at the file system into the 
sequential writes at the SSD, thus achieving high per- 
formance and also prolonging the lifespan of the SSD. 
Also, in order to exploit the skewness in I/O workloads, 
SFS captures the hotness semantics at file block level 
and utilizes these in grouping data eagerly on writing. In 
particular, we devised an iterative segment quantization 
algorithm for correct data grouping and also proposed 
the cost-hotness policy for victim segment selection. Our 
experimental evaluation confirms that SFS considerably 
outperforms existing file systems such as LFS, ext4, and 
btrfs, and prolongs the lifespan of SSDs by drastically 
reducing block erase count inside the SSD. 

Another interesting question is the applicability of 
SFS for HDD. Though SFS was originally designed for 
targeting primarily for SSDs, its key techniques are ag- 
nostic to storage devices. While random write is more 
serious in SSD since it hurts the lifespan as well as per- 
formance, it hurts performance also in HDD due to in- 
creased seek-time. We did preliminary experiments to 
see if SKS is beneficial in HDD and got promising ex- 
perimental results. As future work, we intend to explore 
the applicability of SFS for HDD in greater depth. 
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Abstract 


Flash-based solid-state drives (SSDs) have the poten- 
tial to eliminate the I/O bottlenecks in data-intensive ap- 
plications. However, the large performance discrepancy 
between Flash reads and writes introduces challenges 
for fair resource usage. Further, existing fair queueing 
and quanta-based I/O schedulers poorly manage the I/O 
anticipation for Flash I/O fairness and efficiency. Some 
also suppress the I/O parallelism which causes substan- 
tial performance degradation on Flash. This paper de- 
velops FIOS, a new Flash I/O scheduler that attains fair- 
ness and high efficiency at the same time. FIOS em- 
ploys a fair I/O timeslice management with mechanisms 
for read preference, parallelism, and fairness-oriented 
I/O anticipation. Evaluation demonstrates that FIOS 
achieves substantially better fairness and efficiency com- 
pared to the Linux CFQ scheduler, the SFQ(D) fair 
queueing scheduler, and the Argon quanta-based sched- 
uler on several Flash-based storage devices (including 
a CompactFlash card in a low-power wimpy node). In 
particular, FIOS reduces the worst-case slowdown by a 
factor of 2.3 or more when the read-only SPECweb work- 
load runs together with the write-intensive TPC-C. 


1 Introduction 


NAND Flash devices [1, 20, 24] are widely used as 
solid-state storage on conventional machines and low- 
power wimpy nodes [2,6]. Compared to mechanical 
disks, they deliver much higher I/O performance which 
can alleviate the I/O bottlenecks in critical data-intensive 
applications. Emerging non-volatile memory (NVRAM) 
technologies such as phase-change memory [10, 12], 
memristor, and STT-MRAM promise even better perfor- 
mance. However, these NVMs under today’s manufac- 
turing technologies still suffer from low space density 
(or high $/GB) and stability/durability problems. Until 
these issues are resolved sometime in the future, NAND 
Flash devices will likely remain the dominant solid-state 
storage in computer systems. 


“This work was supported in part by the National Science Founda- 
tion (NSF) grant CCF-0937571, NSF CAREER Award CCF-0448413, 
a Google Research Award, and an IBM Faculty Award. 


While Flash-based storage devices may offer substan- 
tially improved I/O performance over mechanical disks, 
there are critical limitations with respect to writes. First, 
Flash suffers from an erase-before-write limitation. That 
is, in order to overwrite a previously written location, the 
said location must first be erased before writing the new 
data. Further aggravating the problem is that the era- 
sure granularity is typically much larger (64—256 x ) than 
the basic I/O granularity (2-8 KB). This leads to a large 
read/write speed discrepancy—Flash reads can be one or 
two orders of magnitude faster than writes. This is very 
different from mechanical disks on which read/write per- 
formance are both dominated by seek/rotation delays and 
exhibit similar characteristics. 

For a concurrent workload with a mixture of readers 
and synchronous writers running on Flash, readers may 
be blocked by writes with substantial slowdown. This 
means unfair resource utilization between readers and 
writers. In extreme cases, it may present vulnerability 
to denial-of-service attacks—a malicious user may in- 
voke a workload with a continuous stream of writes to 
block readers. At the opposite end, strictly prioritiz- 
ing reads over writes might lead to unfair (and some- 
times extreme) slowdown for applications performing 
synchronous writes. Synchronous writes are essential for 
applications that demand high data consistency and dura- 
bility, including databases, data-intensive network ser- 
vices [28], persistent key-value store [2], and periodic 
state checkpointing [19]. 

With important implications on performance and re- 
liability, Flash I/O fairness warrants first-class atten- 
tion in operating system I/O scheduling. Conventional 
scheduling methods to achieve fairness (like fair queue- 
ing [5, 18] and quanta-based scheduling [3, 36]) fail 
to recognize unique Flash characteristics like substan- 
tial read-blocked-by-write. In addition, I/O anticipa- 
tion (temporarily idling the device in anticipation of a 
soon-arriving desirable request) is sometimes necessary 
to maintain fair resource utilization. While I/O antici- 
pation was proposed as a performance-enhancing seek- 
reduction technique for mechanical disks [17], its role 
for maintaining fairness has been largely ignored. Fi- 
nally, quanta-based scheduling schemes [3, 36] typically 
suppress the I/O parallelism between concurrent tasks, 
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which substantially degrades the I/O efficiency on Flash 
devices with internal parallelism. 

This paper presents a new operating system I/O sched- 
uler (called FJOS) that achieves fair Flash I/O while at- 
taining high efficiency at the same time. Our scheduler 
uses timeslice management to achieve fair resource uti- 
lization under high I/O load. We employ read preference 
to minimize read-blocked-by-write in concurrent work- 
loads. We exploit device-level parallelism by issuing 
multiple I/O requests simultaneously when fairness is not 
violated. Finally, we manage I/O anticipation judiciously 
such that we achieve fairness with limited cost of device 
idling. 

We implemented our scheduler in Linux and demon- 
strated our results on multiple Flash devices including 
three solid-state disks and a CompactFlash card in a low- 
power wimpy node. Our evaluation employs several ap- 
plication workloads including the SPECweb workload 
on an Apache web server, TPC-C workload on a MySQL 
database, and the FAWN Data Store developed specif- 
ically for low-power wimpy nodes [2]. Our empirical 
work also uncovered a flaw in the current Linux’s in- 
consistent management of synchronous writes across file 
system and I/O scheduler layers. 

The rest of this paper is organized as follows. Sec- 
tion 2 discusses related work. Section 3 characterizes 
key challenges for supporting Flash I/O fairness and ef- 
ficiency that motivate our work. Section 4 presents the 
design of our FIOS scheduler for Flash storage devices. 
Section 5 describes some implementation notes and Sec- 
tion 6 illustrates our experimental evaluation. Section 7 
concludes this paper with a summary of our findings. 


2 Related Work 


There are significant recent research interests in I/O 
performance characterization of Flash-based storage de- 
vices. Agrawal et al. [1] discussed the impact of block 
erasure (before writes) and parallelism to the perfor- 
mance of Flash-based SSDs. Polte et al. [31] found 
that Flash reads are substantially faster than writes. Past 
studies identified abnormal performance issues due to 
read/write interference and storage fragmentation [7], 
as well as erasure-induced variance of Flash write la- 
tency [9]. There is also a recognition on the im- 
portance of internal parallelism to the Flash I/O effi- 
ciency [8, 30] while our past work identified that the 
effects of parallelism depend on specific firmware im- 
plementations [30]. Previous Flash I/O characterization 
results provide motivation and foundation for Flash I/O 
scheduling work in this paper. 

Recent research has investigated operating system 
techniques to manage Flash-based storage. File system 
work [11, 23,25] has attempted to improve the sequen- 
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tial write patterns through the use of log-structured file 
systems. These efforts are orthogonal to our research 
on Flash I/O scheduling. New I/O scheduling heuris- 
tics were proposed to improve Flash I/O performance. 
In particular, write bundling [21], write block preferen- 
tial [14], and page-aligned request merging/splitting [22] 
help match I/O requests with the underlying Flash de- 
vice data layout. The effectiveness of these write align- 
ment techniques, however, is limited on modern SSDs 
with write-order-based block mapping. Further, previ- 
ous Flash I/O schedulers have paid little attention to the 
issue of fairness. 

Conventional I/O schedulers are largely designed to 
mitigate the high seek and rotational costs in mechan- 
ical disks, through elevator-style I/O request ordering 
and anticipatory I/O [17]. Quality-of-service objectives 
(like meeting task deadlines) were also considered in I/O 
scheduling techniques, including Facade [27], Reddy et 
al. [33], pClock [16], and Fahrrad [32]. Fairness was not 
a primary concern in these techniques and they cannot 
address the fairness problems in Flash storage devices. 

Fairness-oriented resource scheduling has been ex- 
tensively studied in the literature. The original fair 
queueing approaches including Weighted Fair Queue- 
ing (WFQ) [13], Packet-by-Packet Generalized Proces- 
sor Sharing (PGPS) [29], and Start-time Fair Queueing 
(SFQ) [15] take virtual time-controlled request ordering 
over several task queues to maintain fairness. While 
they are designed for network packet scheduling, later 
fair queueing approaches like YFQ [5] and SFQ(D) [18] 
are adapted to support I/O resources. In particular, they 
allow the flexibility to re-order and parallelize I/O re- 
quests for better efficiency. Alternatively, I/O fair queue- 
ing can be achieved using dedicated per-task quanta 
(as in Linux CFQ [3] and Argon [36]) and credits (as 
in the SARC rate controller [37]). Achieving fairness 
and efficiency on Flash storage, however, must address 
unique Flash I/O characteristics like read/write perfor- 
mance asymmetry and internal parallelism. A proper 
management of I/O anticipation for fairness is also nec- 
essary. 


3 Challenges and Motivation 


We characterize key challenges for supporting Flash 
I/O fairness and maintaining high efficiency at the same 
time. They include effects of inherent device charac- 
teristics (read/write asymmetry and internal parallelism) 
as well as behavior of operating system I/O schedulers 
(role of I/O anticipation). These results and analysis 
serve as both background and motivation for our new I/O 
scheduling design. 

Experiments in this section and the rest of the paper 
will utilize the following Flash-based storage devices— 
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Figure |: Distribution of 4 KB read response time on four Flash-based storage devices. The first row shows the read 
response time when a read runs alone. The second row shows the read performance at the presence of a concurrent 
4 KB write. The two figures in each column (for one drive) use the same X-Y scale and they can be directly compared. 
Figures across different columns (for different drives) necessarily use different X-Y scales due to differing drive 
characteristics. We intentionally do not show the quantitative Y values (probability densities) in the figures because 
these values have no inherent meaning and they simply depend on the width of each bin in the distribution histogram. 


@ An Intel X25-M Flash-based SSD released in 2009. 
This drive uses multi-level cells (MLC) in which a 
particular cell is capable of storing multiple bits of 
information. 

@ An Mtron Pro 7500 Flash-based SSD, released in 
2008, using single-level cells (SLC). 

@ An OCZ Vertex 3 Flash-based SSD, released in 2011, 
using MLC. This drive employs the SandForce con- 
troller which supports new write acceleration tech- 
niques such as online compression. 

@ A SanDisk CompactFlash drive on a 6-Watts 
“wimpy” node similar to those employed in the 
FAWN array [2]. 


Read/Write Fairness Our first challenge to Flash I/O 
fairness is that Flash writes are often substantially slower 
than reads and a reader may experience excessive slow- 
down at the presence of current writes. We try to un- 
derstand this by measuring the read/write characteris- 
tics of the four Flash devices described above. To ac- 
quire the native device properties, we bypass the mem- 
ory buffer, operating system I/O scheduler, and the de- 
vice write cache in the measurements. We also use in- 
compressible data in the I/O measurement to assess the 
baseline performance for the Vertex drive (whose Sand- 
Force controller performs online compression). 

Our measurement employs 4KB reads or writes to 
random storage locations. Figure | illustrates the read 


response time distribution in two cases—tread alone and 
read at the presence of a concurrent write. Comparing 
that with the read-alone performance (first row), we find 
that a Flash read can experience one or two orders of 
magnitude slowdown while being blocked by a concur- 
rent write. Further, the Flash read response time be- 
comes much less stable (or more unpredictable) when 
blocked by a concurrent write. One exception to this 
finding is the Vertex drive with the SandForce controller. 
Writes on this drive is only modestly slower than reads 
and therefore the read-block-by-write effect is much less 
pronounced on this drive than on others. 


We further examine the fairness between two tasks— 
a reader that continuously performs 4 KB reads to ran- 
dom locations (issues another one immediately after the 
previous one completes) and a writer that continuously 
performs synchronous 4 KB writes to random locations. 
Figure 2 shows the slowdown ratios for reads and writes 
during a concurrent execution. Results show that the 
write slowdown ratios are close to one on all Flash stor- 
age devices, indicating that the write performance in the 
concurrent execution is similar to the write-alone per- 
formance. However, reads experience 7x, 157x, 2x, 
and 42 x slowdown on the Intel SSD, Mtron SSD, Vertex 
SSD, and the low-power CompactFlash respectively. 


Existing fairness-oriented I/O schedulers [3, 5, 18, 36, 
37] do not recognize the Flash read/write performance 
asymmetry. Consequently they provided no support to 
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Figure 2: Slowdown of random 4 KB reads and writes 
in a concurrent execution. The I/O slowdown ratio for 
read (or write) is the I/O latency normalized to that when 
running alone. 


address the problem of excessive read-blocked-by-write 
on Flash. 


Role of I/O Anticipation — [I/O anticipation (temporarily 
idling the device in anticipation of a soon-arriving desir- 
able request) was proposed as a performance-enhancing 
seek-reduction technique for mechanical disks [17]. 
However, its performance effects on Flash are largely 
negative because the cost of device idling far outweighs 
limited benefit of I/O spatial proximity. Due to the lack 
of performance gain on Flash, the Linux CFQ scheduler 
disables I/O anticipation for non-rotating storage devices 
like Flash. Fair queueing approaches like YFQ [5] and 
SFQ(D) [18] also provide no support for I/O anticipa- 
tion. 

However, I/O anticipation is sometimes necessary to 
maintain fair resource utilization. Without anticipation, 
unfairness may arise due to the prematurely switching 
task queues before the allotted I/O quantum is fully uti- 
lized (in quanta-based scheduling) or the premature ad- 
vance of virtual time for “inactive tasks” (in fair queueing 
schedulers). Consider the simple example of a concur- 
rent run involving a reader and a writer. After servicing 
a read, the only queued request at the moment is a write 
and therefore a work-conserving I/O scheduler will issue 
it. This breaks up the allotted quantum for the reader. 
Even if the reader issues another read after a short think- 
time, it would be blocked by the outstanding write. 

At the opposite end, the quanta-based scheduling in 
Argon [36] employs aggressive I/O anticipation such that 
itis willing to wait through a task queue’s full quantum 
even if few requests are issued. Such excessive I/O antic- 
ipation can lead to long idle time and drastically reduce 
performance on Flash storage if useful work could other- 
wise have been accomplished. Particularly for fast Flash 
storage, a few milliseconds are often sufficient for com- 
pleting a significant amount of work. 
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Figure 3: Fairness of different I/O anticipation ap- 
proaches for concurrent reader/writer on the Intel SSD. 


We run a simple experiment to demonstrate the fair- 
ness and efficiency effects of improper I/O anticipation 
on Flash. We run a reader and a writer concurrently on 
the Intel SSD. Each task induces some thinktime be- 
tween I/O such that the thinktime time 1s approximately 
equal to its I/O device usage time. Figure 3 shows the 
reader/writer slowdown under three I/O scheduling ap- 
proaches. Implementation details of the schedulers are 
provided later in Section 5. The Linux CFQ and SFQ(D) 
do not support I/O anticipation which leads to poor fair- 
ness between the reader and writer. The full-quantum 
anticipation exhibits better fairness (similar reader/writer 
slowdown) but this 1s achieved at excessive slowdown for 
both reader and writer. Such fairness 1s not worthwhile. 

While our discussion above uses the example of a 
reader running concurrently with a writer, the fairness 
implication of I/O anticipation generally applies to con- 
current tasks with requests of differing resource usage. 
For instance, similar fairness problems with no I/O an- 
ticipation or over-aggressive anticipation can arise when 
a task making 4 KB reads runs concurrently with a task 
making 128 KB reads. 


Parallelism vs. Fairness Flash-based SSDs have some 
built-in parallelism through the use of multiple channels. 
Within each channel, each Flash package may have mul- 
tiple planes which are also parallel. Figure 4 shows the 
efficiency of Flash I/O parallelism for 4KB reads and 
writes on our Intel, Mtron, and Vertex SSDs. We observe 
that the parallel issuance of multiple reads to an SSD may 
lead to throughput enhancement. The speedup is mod- 
est (about 30%) for the Mtron SLC drive but substantial 
(up to 7-fold and 4-fold) for the Intel and Vertex MLC 
drives. On the other hand, writes do not seem to benefit 
from I/O parallelism on the Intel and Mtron drives while 
write parallelism on the Vertex drive can have up to 3- 
fold speedup. We also experimented with parallel I/O at 
larger (>4 KB) sizes and we found that the speedup of 
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Figure 4: Efficiency of I/O parallelism for 4 KB reads and writes on three Flash-based SSDs. 


parallel request issuance is less substantial for large I/O 
requests. A possible explanation is that a single large I/O 
request may already benefit from the internal device par- 
allelism and therefore parallel request issuance will see 
less additional efficiency gain. 

The internal parallelism on Flash-based SSDs has sig- 
nificant implication on fairness-oriented I/O schedul- 
ing. In particular, the quanta-based schedulers (like 
Linux CFQ [3] and Argon [36]) only issue I/O requests 
from one task queue at a time, which limits parallelism. 
The rationale is probably to ease the accounting and allo- 
cation of device time usage for each queue. However, the 
suppression of I/O parallelism in these schedulers may 
lead to substantial performance degradation on Flash. 
A desired Flash I/O scheduler must exploit device-level 
parallelism by issuing multiple I/O requests simultane- 
ously while ensuring fairness at the same time. 


4 FIOS Design 


In a multiprocessing system, many resource principals 
simultaneously compete for the shared I/O resource. The 
scheduler should regulate I/O in such a way that accesses 
are fair. When the storage device time is the bottleneck 
resource in the system, fairness 1s the case that each re- 
source principal acquires an equal amount of device time. 
When the storage device is partially loaded, the critical 
problem is that a read blocked by a write experiences far 
worse slowdown than a write blocked by a read. Such 
worst-case slowdown should be minimized. 

Practical systems may desire fairness for different 
kinds of resource principals. For example, a general- 
purpose operating system may desire fairness support 
among concurrent processes. A server system may need 
fairness across simultaneously running requests [4, 34]. 
A shared hosting platform may want fairness across mul- 
tiple virtual machines [26]. Our design of fair Flash I/O 
scheduling and much of our implementation can be gen- 


erally applied to supporting arbitrary resource principals. 
When describing the FIOS design, we use the term task 
to represent the resource principal that receives the fair- 
ness support in a concurrent execution. 

Our I/O scheduler, FIOS, tries to achieve fairness 
while attaining high efficiency at the same time. Based 
on our evaluation and analysis in Section 3, our scheduler 
contains four techniques. We first provide a fair times- 
lice management that allows timeslice fragmentation and 
concurrent request issuance (Section 4.1). We then sup- 
port read preference to minimize the read-blocked-by- 
write situations (Section 4.2). We further enable concur- 
rent issuance of requests to maximize the efficiency of 
device-level parallelism (Section 4.3). Finally, we devise 
limited I/O anticipation to maintain fairness at minimal 
device idling cost (Section 4.4). 


4.1 Fair Timeslice Management 


FIOS builds around a fairness mechanism of equal 
timeslices which govern the amount of time a task has 
access to the storage device. As each task is given equal 
time-based access to the storage device, the disparity be- 
tween read and write access latency of Flash cannot lead 
to unequal device usage between tasks. In addition, us- 
ing timeslices provides an upper bound on how long a 
task may have access to the storage device, ensuring that 
no task will be starved indefinitely. 

Our I/O timeslices are reminiscent of the I/O quanta in 
quanta-based fairness schedulers like Linux CFQ [3] and 
Argon [36]. However, the previous quanta-based sched- 
ulers suffer two important limitations that make them un- 
suitable for Flash fairness and efficiency. 


e First, their I/O quanta do not allow fragmentation— 
a task must use its current quantum continuously 
or it will have to wait for its next quantum in the 
round-robin order. The rationale (on mechanical disk 
storage devices) was that long continuous run by a 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


199 


160 


single task tends to require less disk seek and ro- 
tation [36]. But for a task that performs I/O with 
substantial inter-I/O thinktime, this design leaves two 
undesirable choices—either its quantum ends prema- 
turely so the remaining allotted resource is forfeited 
(as in Linux CFQ) or the device idles through a task’s 
full quantum even if few requests are issued (as in 
Argon). 

@ Second, the previous quanta-based schedulers only 
allow I/O requests from one task to be serviced at a 
time. This was a reasonable design decision for in- 
dividual mechanical disks that do not possess inter- 
nal parallelism. It also has the advantage of easy re- 
source accounting for each task. However, this mech- 
anism suppresses Flash I/O parallelism and conse- 
quently hurts I/O efficiency. 


To address these problems, FIOS allows I/O times- 
lice fragmentation and concurrent request issuance from 
multiple tasks. Specifically, we manage timeslices in an 
epoch-based fashion. An epoch is defined by a collec- 
tion of equal timeslices, one per task; the I/O scheduler 
should achieve fairness in each epoch. After an I/O com- 
pletion, the task’s remaining timeslice is decremented by 
an appropriate I/O cost. The cost is the elapsed time from 
the I/O issuance to its completion when the storage de- 
vice 1s dedicated to this request in this duration. The cost 
accounting is more complicated in the presence of paral- 
lel I/O from multiple tasks, which will be elaborated in 
Section 4.3. A currently active task does not forfeit its 
remaining timeslice should another task be selected for 
service by the scheduler. In other words, the timeslice 
of a task can be consumed over several non-contiguous 
periods within an epoch. Once a task has consumed its 
entire timeslice, it must wait until the next epoch at which 
point its timeslice is refreshed. 

The current epoch ends and a new epoch begins when 
either 1) there is no task with non-zero remaining times- 
lice in the current epoch; or 2) all tasks with non-zero 
remaining timeslices make no I/O request. Fairness must 
be maintained in the case of deceptive idleness [17]. 
Specifically, the I/O scheduler may observe a short idle 
period from a task between two consecutive I/O requests 
it makes. A fair-timeslice epoch should not end at such 
deceptive idleness if the task has non-zero remaining 
timeslice. This is addressed through fairness-oriented 
I/O anticipation elaborated in Section 4.4. 


4.2 Read/Write Interference Management 


Our preliminary evaluation in Section 3 shows strong 
interference between concurrent reads and writes on 
some of the Flash drives, an effect also observed by oth- 
ers [7]. Considering that reads are faster than writes, 
reads suffer more dramatically from such interference 
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while the impact on writes appears marginal. A con- 
current write not only slows down reads, it also disrupts 
device-level read parallelism which leads to further effi- 
ciency loss. Part of our fairness goal is to minimize the 
worst-case task slowdown. For such fairness, we adopt a 
policy of read preference combined with write blocking 
to reduce the read-interfered-by-write occurrences. Such 
a policy gives preference to shorter jobs, which tends to 
produce faster mean response time than a scheduler that 
is indiscriminate of job service time. This is a side bene- 
fit beyond minimizing the worst-case slowdown. 


When both read and write requests are queued in the 
I/O scheduler, our policy of read preference will allow 
read requests to be issued first. To further avoid inter- 
ference from later-issued writes, we block all write re- 
quests until outstanding reads are completed. Under this 
approach, a read is only blocked by a write when the 
read arrives at the I/O scheduler after the write has al- 
ready been issued. This is due to the non-preemptibility 
of I/O. Both read preference and write blocking lead to 
additional queuing time for writes. Fortunately, because 
reads are serviced quickly, the additional queueing time 
the write request experiences is typically small compared 
to the write service time. Note that the read preference 
mechanism is still governed by the epoch-based times- 
lice enforcement, which serves as an ultimate preventer 
of write starvation. 


Our preliminary evaluation in Section 3 also shows 
that while the read/write interference is very strong on 
some drives, it is quite modest on the Vertex SSD. On 
such a drive, the benefit of read preference and write 
blocking is modest and it may be outweighed by its draw- 
backs of possible write starvation and suppressing the 
mixed read/write parallelism. Therefore the read/write 
interference management is an optional feature in FIOS 
that can be disabled for drives that do not exhibit strong 
read/write interference. 


4.3 I/O Parallelism 


Many Flash-based solid-state drives contain internal 
parallelism that allows multiple I/O requests to be ser- 
viced at the same time. To achieve high efficiency and 
exploit the parallel architecture in Flash, multiple I/O 
requests should be issued to the Flash device in paral- 
lel when fairness is not violated. After issuing an I/O 
request to the storage device, FIOS searches for ad- 
ditional requests which may be queued, possibly from 
other tasks. Any I/O requests that are found are issued 
as long as the owner tasks have enough remaining times- 
lices and the read/write interference management (if en- 
abled) is observed. 


I/O parallelism allows multiple tasks to access the stor- 
age device concurrently, which complicates the account- 
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ing of I/O cost. In particular, a task should not be billed 
by the full elapsed time from its request issuance to com- 
pletion if requests from other tasks are simultaneously 
outstanding on the storage device. The ideal cost ac- 
counting for an I/O request should exclude the request 
queueing time at the device during which it waits for 
other requests and it does not consume the bottleneck re- 
source. A precise accounting, however, is difficult with- 
out the device-level knowledge of resource sharing be- 
tween multiple outstanding requests. 

We support two approaches for I/O cost accounting 
under parallelism. In the first approach, we calibrate the 
elapsed time of standalone read/write requests at differ- 
ent data sizes and use the calibration results to assign the 
cost of an I/O request online depending on its type (read 
or write) and size. Our implementation further assumes a 
linear model (typically with a substantial nonzero offset) 
between the cost and data size of an I/O request. There- 
fore we only need to calibrate four cases (read 4 KB, read 
128 KB, write 4 KB, and write 128 KB) and use the lin- 
ear model to estimate read/write costs at other data sizes. 
In practice, such calibration is performed once for each 
device, possibly at the device installation time. Note that 
the need of request cost estimation is not unique to our 
scheduler. Start-time Fair Queueing schedulers [15, 18] 
also require a cost estimation for each request when it 
just arrives (for setting its start and finish tags). 

When the calibrated I/O costs are not available, our 
scheduler employs a backup approach for I/O cost ac- 
counting. Here we make the following assumption about 
the sharing of cost for parallel I/O. During a time period 
when the set of outstanding I/O requests on the storage 
device remains unchanged (no issuance of a new request 
or completion of an outstanding request), all outstanding 
I/O requests equally share the device usage cost in this 
time period. This is probabilistically true when the inter- 
nal device scheduling and operation is independent of the 
task owning the request. Such an assumption allows us 
to account for the cost of parallel I/O with only informa- 
tion available to the operating system. Since the device 
parallelism may change during a request’s execution, an 
accurate accounting of a request’s execution parallelism 
would require carefully tracking the device parallelism 
throughout its execution duration. For simplicity, we use 
the device parallelism at the time of request issuance to 
represent the request execution parallelism. Specifically, 
the I/O cost is calculated as 


TT. 
Cost = —Saeset (1) 


Ps suance 


where Tlapsead 18 the request’s elapsed time from its is- 
suance to its completion, and Pissuance 18 the number of 
outstanding requests (including the new request) at the 
issuance time. 


4.4 I/O Anticipation for Fairness 


Between two consecutive I/O requests made by a task, 
the I/O scheduler may observe a short idle period. This 
idle period is unavoidable because it takes non-zero time 
for the task to wake up and issue another request. Such 
an idleness is deceptive for tasks that continuously make 
synchronous [I/O requests. The deceptive idleness can 
be addressed by I/O anticipation [17], which idles the 
storage device in anticipation of a soon-arriving new I/O 
request. On mechanical disks, I/O anticipation can sub- 
stantially improve the I/O efficiency by reducing the seek 
and rotation overhead. In contrast, I/O spatial proxim- 
ity has much less benefit for Flash storage. Therefore 
I/O anticipation has a negative performance effect and 
it must be used judiciously for the purpose of maintain- 
ing fairness. Below we describe two important decisions 
about fairness-oriented I/O anticipation on Flash—When 
to anticipate? How long to anticipate? 


When to anticipate? Anticipation is always consid- 
ered when a request is just completed. We call the task 
that owns the just completed request the anticipating 
task. 

Deceptive idleness may break fair timeslice manage- 
ment when it prematurely triggers an epoch switch while 
the anticipating task will quickly process the just com- 
pleted I/O request and issue another one soon. I/O an- 
ticipation should be utilized to remedy such a fairness 
violation. Specifically, while an epoch would normally 
end if there is no outstanding I/O request from a task 
with non-zero remaining timeslice, we initiate an antici- 
pation before the epoch switch if the anticipating task has 
non-zero remaining timeslice. In this case the anticipa- 
tion target can be either a read or write, though it is more 
commonly write since writers are more likely delayed to 
the end of an epoch under read preference. 

Deceptive idleness may also break read preference. 
When there are few tasks issuing reads, there may be 
instances when no read request is queued. In order to 
facilitate read preference, I/O anticipation is necessary 
after completing a read request. If a read request has just 
been completed, we anticipate for another read request to 
arrive shortly. We do so rather than immediately issuing 
a write to the device that may block later reads. 


How long to anticipate? I[/O anticipation duration 
must be bounded in case the anticipated I/O request 
never arrives. For maximum applicability and robust- 
ness, the system should not assume any application hints 
or predictor of the application inter-I/O thinktime. For 
seek-reduction on mechanical disks, the I/O anticipation 
bound is set to roughly the time of a disk I/O operation 
which leads to competitive performance compared to the 
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optimal offline I/O anticipation. In practice, this is often 
set to 6 or 8 milliseconds. Our I/O anticipation bound 
must be different for two reasons. First, the original an- 
ticipation bound addresses the device idling’s tradeoff 
with performance gain of seek reduction. Anticipation 
has a negative performance effect on Flash and we in- 
stead target the different tradeoff with maintaining fair- 
ness. Second, the Flash I/O service time is much smaller 
than that of a disk I/O operation. This exacerbates the 
cost of anticipation-induced device idling on Flash. 
FIOS sets the I/O anticipation bound according to 
a configurable threshold of tolerable performance loss 
for maintaining fairness. This threshold, a, indicates 
the maximum proportion of time FIOS idles the de- 
vice (while there is pending work) to anticipate for fair- 
ness. Specifically, when the deceptive idleness is about 
to break fairness, we anticipate for an idling time bound 
OF Tvsgice* = where T service 18 the average service time 
of an I/O request for the anticipating task. This ensures 
that the maximum device idle time is no more than a 
proportion of the total device time in a sequence of 


I/O — anticipation — I/O — anticipation — --- 


In our implementation, FIOS maintains the per-request 
/O service time Teervice for each task using an 
exponentially-weighted moving average of past request 
statistics. FIOS sets a = 0.5 by default. 

Anticipation-induced device idling consumes device 
time and its cost must be properly accounted and at- 
tributed. We charge the anticipation cost to the timeslice 
of the anticipating task. 


5 Implementation Notes 


We implemented our FIOS scheduler with the tech- 
niques of fair timeslice management, read preference, 
I/O parallelism, and I/O anticipation for fairness on 
Linux 2.6.33.4. As part of a general-purpose operat- 
ing system, our prototype provides fairness to concur- 
rent processes. This implementation can be easily ex- 
tended to support request-level fairness in a server sys- 
tem [4,34] or virtual machine fairness in a shared hosting 
platform [26]. 

Our I/O anticipation may sometimes desire a very 
short timer (a few hundred microseconds). The de- 
fault Linux I/O schedulers use the kernel tick-based 
timer. Specifically with 1000 Hz kernel ticks, the min- 
imum timer is 1 millisecond. Further, because the ker- 
nel ticks are not synchronized with the timer setup, the 
next tick may occur right after the timer is set. This 
means that setting the timer to fire at the next tick may 
sometimes lead to almost no anticipation. Our recent re- 
search [35] showed that this already happened to some 
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production versions of Linux with coarse-grained tick 
timers. Our FIOS implementation instead uses the Linux 
high-resolution timer that can be supported by the pro- 
cessor hardware counter overflow interrupts. This allows 
us to set precise, fine-grained anticipation timers. 

For comparison purposes, we implemented two alter- 
native fairness-oriented I/O schedulers in our experimen- 
tal platform. The first alternative is SFQ(D) [18], which 
is based on the Start-time Fair Queueing approach [15] 
but also allows concurrent request issuance for I/O ef- 
ficiency. The concurrency is controlled by a depth pa- 
rameter D. We set the depth to 32 which allows suffi- 
cient I/O parallelism in all our experiments. The SFQ(D) 
scheduler requires a cost estimation for each request 
when it just arrives (for setting its start and finish tags 
in SFQ(D)). In our implementation, we estimate a read’s 
cost as the average read service time on the device; simi- 
larly, we estimate the cost of a write as the average write 
service time on the device. 

The second alternative is a quanta-based I/O sched- 
uler like the one employed in Argon [36]. This approach 
puts a high priority on achieving fair resource use (even 
if some tasks only have partial I/O load). All tasks take 
round robin turns of I/O quanta. Each task has exclusive 
access to the storage device within its quantum. Once an 
I/O quantum begins, it will last to its end, regardless of 
how few requests are issued by the corresponding task. 
However, a quantum will not begin, if no request from 
the corresponding task is pending. 

The Linux CFQ, our FIOS scheduler, and the quanta 
scheduler all use the concept of per-task timeslice or 
quantum. In the Linux CFQ, the default timeslice is 
100 milliseconds, with minor adjustment according to 
task priorities. Our FIOS and quanta scheduler imple- 
mentations follow the same setting of per-task times- 
lice/quantum. 

During our empirical work, we discovered a flaw in 
Linux that it inconsistently manages synchronous writes 
across the file system and I/O scheduler layers. Specif- 
ically, a synchronous operation at the file system level 
(such as a write on an O_SYNC-opened file and I/O as part 
of a fsync() call) is not necessarily considered to be 
synchronous at the I/O scheduler. Note that this incon- 
sistency does not lead to wrong synchronous I/O seman- 
tics to the application since the file system will force a 
wait on the I/O completion before returning to the ap- 
plication. However, being treated as asynchronous I/O 
at the I/O scheduler means that they are scheduled with 
lowest priority, leading to excessive delay by the applica- 
tions who perform synchronous I/O. We fixed this prob- 
lem by patching mpage_writepage() functions in the 
Linux kernel so that file system-level synchronous op- 
erations are properly considered synchronous I/O at the 
scheduler. 
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We perform experiments on the ext4 file system. The 
ext4 file system uses very fine-grained file timestamps (in 
nanoseconds) so that each file write always leads to a new 
modification time and thus triggers an additional meta- 
data write. This is unnecessarily burdensome to many 
write-intensive applications. We revert back to file times- 
tamps in the granularity of seconds (which is the default 
in Linux file systems that do not make customized set- 
tings). In this case, at most one timestamp metadata write 
per second is needed regardless how often the file is mod- 
ified. 

We also found that the file system journaling writes 
made the evaluation results less stable and harder to in- 
terpret. Therefore we disabled the journaling in our ex- 
periments. We do not believe this setup choice affects 
the fundamental results of our evaluation. 


6 Experimental Evaluation 


We compare FIOS’s fairness and efficiency against 
three alternative fairness-oriented [/O schedulers— 
Linux CFQ scheduler [3], SFQ(D) start-time fair queue- 
ing with a concurrency depth [18], and a quanta-based 
I/O scheduler similar to the one employed in Argon [36]. 
Implementation details for some of these schedulers 
were provided in the previous section. We also compare 
against the raw device I/O in which requests are issued 
to the storage devices as soon as they are passed from the 
file system. 

We explain our fairness and efficiency metrics in eval- 
uation. Fairness is defined as the case that each task gains 
equal access to resources. In a concurrent execution with 
n tasks, this can be observed if each task experiences a 
factor of n slowdown compared to running-alone. We 
call this proportional slowdown. Note that better perfor- 
mance may be achieved when some tasks only contain 
partial I/O load (i.e., they do not make I/O requests for 
significant parts of their execution). Some tasks may also 
gain better performance if they are able to utilize the al- 
lotted resources more efficiently (e.g., through exploiting 
device internal parallelism). However, fairness dictates 
that none should exhibit substantially worse performance 
than the proportional slowdown. 

We also devise a metric to represent the overall system 
efficiency of a concurrent execution. This metric, we call 
concurrent efficiency, measures the relative throughput 
of the concurrent execution to the running-alone through- 
put of individual tasks. Intuitively, it assigns a base ef- 
ficiency of 1.0 to each task’s running-alone performance 
(at the absence of resource competition and interference) 
and then weighs the throughput of a concurrent execu- 
tion against the base efficiency. Consider n concurrent 
tasks t1, to, ---, tn. Let ¢;’s running-alone throughput 
be Thrput®°"’. Let ¢;’s throughput in the concurrent ex- 


ecution be Thrput;°"". Then formally for the concurrent 


execution: 


n 
Concurrent efficiency = S 
i=l 


Conc 

Thrput; | 2) 
Thrput*°"" 
An efficiency of less than 1.0 indicates the overhead of 
concurrent execution or the lack of full utilization of re- 
sources. An efficiency of greater than 1.0 indicates the 
additional benefit of concurrent execution, e.g., due to 
exploiting the parallelism in the storage device. 

Our experiments utilize the Flash-based storage de- 
vices described in the beginning of Section 3. They 
include three (Intel/Mtron/Vertex) Flash-based SSDs as 
well as a low-power SanDisk CompactFlash drive. 

Section 6.1 will first evaluate the fairness and effi- 
ciency using a set of synthetic benchmarks with varying 
I/O concurrency. Section 6.2 then provides evaluation 
with realistic applications of the SPECweb workload on 
an Apache web server and the TPC-C workload on a 
MySQL database. Finally, Section 6.3 performs evalu- 
ation on a CompactFlash drive in a low-power wimpy 
node using the FAWN Data Store workload [2]. 


6.1 Evaluation with Synthetic I/O Benchmarks 


Synthetic I/O benchmarks allow us to flexibly vary 
parameters in the resource competition. Each synthetic 
benchmark contains a number of tasks issuing I/O re- 
quests of different types and sizes. Evaluation here con- 
siders four benchmark cases: 


e /-reader 1-writer that concurrently runs a reader 
continuously issuing 4 KB reads and a writer contin- 
uously issuing 4 KB writes; 

e 4-reader 4-writer that concurrently runs four 4 KB 
readers and four 4 KB writers; 

@ 4-reader 4-writer (with thinktime) that is like the 
above case but each task also induces some exponen- 
tially distributed thinktime between I/O such that the 
total thinktime time is approximately equal to its I/O 
device usage time; 

e 4KB-reader and 128 KB-reader that concurrently 
runs a reader continuously issuing 4KB reads and 
another reader continuously issuing 128 KB reads. 


The last case helps evaluate the value of FIOS for read- 
only workloads or workloads in which writes are asyn- 
chronous and delayed to the background. 


Fairness Figure 5 illustrates the fairness and perfor- 
mance of the three read/write benchmark cases under 
different I/O schedulers. On the two drives (Intel/Mtron 
SSDs) with strong read/write interference, the raw device 
I/O, Linux CFQ, and SFQ(D) fail to achieve fairness. 
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Figure 5: Fairness and performance of synthetic read/write benchmarks under different I/O schedulers. The //O 
slowdown ratio for read (or write) is the I/O latency normalized to that when running alone. Results cover three Flash- 
based SSDs (corresponding to the three columns) and three workload scenarios with varying reader/writer concurrency 
(corresponding to the three rows). For each case, we mark the slowdown ratio that is proportional to the total number 


of tasks in the system, which is a measure of fairness. 


Specifically, readers experience many times the propor- 
tional slowdown while writers are virtually unaffected. 
Because raw device I/O makes no attempt to schedule 
I/O, reads and writes are interleaved as they are issued 
by applications, severely affecting the response of read 
requests. The Linux CFQ does not perform much better 
because it disables I/O anticipation for non-rotating stor- 
age devices like Flash and it suppresses I/O parallelism 
between concurrent tasks. SFQ(D) also suffers from 
poor fairness due to its lack of I/O anticipation. For in- 
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stance, without anticipation, two-task executions degen- 
erate to one-read/one-write interleaved I/O issuance and 
poor fairness. The quanta scheduler achieves better fair- 
ness than other alternatives due to its aggressive main- 
tenance of per-task quantum. However, it suffers from 
the cost of excessive I/O anticipation and suppression of 
I/O parallelism. In contrast, FIOS maintains fairness (ap- 
proximately at or below proportional slowdown) in all 
the evaluation cases due to our proposed techniques. 


On the Vertex SSD, most schedulers achieve good fair- 
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Figure 6: Fairness and performance of two-reader (at different read sizes) benchmark under different I/O schedulers. 
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Figure 7: Overall system efficiency of synthetic I/O benchmarks under different I/O schedulers. We use the metric of 
concurrent efficiency defined in Equation 2. Results cover four benchmark cases and three SSDs. 


ness for the read/write benchmark cases due to its modest 
read/write interference. However, the quanta scheduler 
still exhibits high cost of excessive I/O anticipation. 


Figure 6 shows the fairness and performance of the 
4 KB-reader and 128 KB-reader benchmark under differ- 
ent I/O schedulers. Results show that only FIOS and 
quanta schedulers can maintain fairness in this case. The 
benefit manifests on all three drives including the Vertex 
SSD. 


Efficiency We next evaluate the overall system effi- 
ciency. Figure 7 illustrates the concurrent efficiency (de- 
fined in Equation 2) under different I/O schedulers. Re- 
sults show FIOS achieves higher efficiency when devices 
allow substantial internal parallelism. These particularly 
include the two cases with four readers on the Intel and 
Vertex SSDs. The quanta scheduler exhibits the worst ef- 
ficiency. This is because its aggressive fairness measures 
lead to substantial efficiency loss. 
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Figure 8: Evaluation on the effect of fairness-oriented 
I/O anticipation in FIOS on the Intel SSD. 


I/O Anticipation for Fairness Figure 8 individually 
evaluates the effect of fairness-oriented I/O anticipation 
in FIOS. We compare with two alternatives—no antic- 
ipation and anticipation for I/O proximity (as designed 
in [17] and implemented in Linux). We use the 4-reader 
4-writer with thinktime to demonstrate the effect of I/O 
anticipation. When there is no anticipation, reads suf- 
fer substantial additional latency because the deceptive 
idleness sometimes breaks read preference. While some 
degree of I/O anticipation is necessary, the conventional 
I/O anticipation for I/O proximity leads to high perfor- 
mance cost due to excessive idling. The I/O anticipation 
in FIOS achieves fairness at modest performance cost. 


Summary of Results FIOS exhibits better fairness 
than all alternative schedulers. In terms of efficiency, it 
is competitive with the best of alternative schedulers in 
all cases. It is particularly efficient on the Intel SSD be- 
cause it can exploit its parallelism while managing the 
read-blocked-by-write problem at the same time. 


Among the alternative schedulers, the quanta sched- 
uler is most fair but very inefficient in many cases due 
to the high cost of its aggressive I/O anticipation. The 
raw device I/O is most efficient but it is unfair in many 
situations, particularly in penalizing the reads. 


FIOS is not only effective for maintaining fairness be- 
tween reads and synchronous writes, it is also benefi- 
cial for regulating read tasks with different I/O costs. 
This demonstrates the value of FIOS to support read- 
only workloads and workloads in which writes are asyn- 
chronous and delayed to the background. Further, this 
makes FIOS valuable for the Vertex drive even though 
its read/write performance discrepancy 1s small. 
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Figure 9: Fairness and performance of SPECweb run- 
ning with TPC-C under different I/O schedulers. The 
slowdown ratio for an application is the average request 
response time normalized to that when the application 
runs alone. Results cover two Flash-based SSDs. 


6.2 Evaluation with SPECweb and TPC-C 


Beyond the synthetic benchmarks, we also perform 
evaluation with realistic workloads. We run the read- 
only SPECweb99 workload (running on an Apache 2.2.3 
web server) along with the write-intensive TPC-C (run- 
ning on a MySQL 5.5.13 database). Each application 
is driven by a closed-loop load generator that contains 
four concurrent clients, each of which issues requests 
continuously (issuing a new request right after the out- 
standing one receives a response). The load generators 
run on a different machine and send requests through 
the network. This evaluation employs the two drives (In- 
tel/Mtron SSDs) that exhibit large read/write interference 
effects. 

Figure 9 illustrates the fairness and performance re- 
sults under different I/O schedulers. Unsurprisingly, the 
read-only SPECweb tends to experience more slowdown 
than the write-intensive TPC-C does on Flash storage. 
Among all scheduling approaches, the quanta scheduler 
exhibits the worst performance and fairness. This is due 
to its excessive I/O anticipation. Realistic application 
workloads (like SPECweb and TPC-C) perform signif- 
icant computation and networking between storage I/O 
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Figure 10: Overall system efficiency of SPECweb run- 
ning with TPC-C under different I/O schedulers. We use 
the metric of concurrent efficiency defined in Equation 2. 
Results cover two Flash-based SSDs. 


that appears as inter-I/O thinktime. Idling the storage de- 
vice through such thinktime (as in the quanta scheduler) 
leads to excessive waste. On the other hand, the poor 
fairness of the raw device I/O, Linux CFQ, and SFQ(D) 
is due to a lack of I/O anticipation and poor management 
of read/write interference on Flash. 

FIOS exhibits better fairness and performance than all 
the alternative approaches, and its performance is more 
stable across the two SSDs. We measure the fairness as 
the worst-case application slowdown in a concurrent ex- 
ecution (SPECweb slowdown in all cases). Compared to 
the quanta scheduler, FIOS reduces the worst-case slow- 
down by a factor of nine or more on both SSDs. Com- 
pared to the raw device I/O, FIOS reduces the worst-case 
slowdown by a factor of 2.3 x on the Mtron SSD. Com- 
pared to the Linux CFQ, FIOS reduces the worst-case 
slowdown by a factor of five on the Mtron SSD. Com- 
pared to SFQ(D), FIOS reduces the worst-case slowdown 
by about 3.1 x on the Intel SSD. 

Figure 10 shows the overall system efficiency of 
SPECweb running with TPC-C under different I/O 
schedulers. Results show that FIOS improves the effi- 
ciency above the best alternative scheduler by 14% and 
18% on the Intel and Mtron SSDs respectively. FIOS 
achieves high efficiency due to its proper management of 
read/write interference, I/O parallelism, and controlled 
I/O anticipation. 


6.3. Evaluation on Low-Power CompactFlash 


We also test FIOS on a low-power wimpy node like 
the ones used in the FAWN work [2]. Specifically, the 
node contains an Alix board with a single-core 500 MHz 
AMD Geode CPU, 256MB SDRAM memory, and a 
16GB SanDisk CompactFlash drive. The full node con- 
sumes about 5.9 Watts of power at peak load. The Com- 
pactFlash, while also NAND Flash-based, is significantly 
less sophisticated than solid state drives. CompactFlash 


HE = Awnvs hash gets FAWNDS hash puts 


proportional slowdown 


Task slowdown ratio 





Rs E.. Sc. Q . A 
Ww, On. Ga: O 
Seyj CRO) : 
Q L 
O 


Figure 11: Performance of concurrent FAWN Data Store 
hash gets (data reads) and hash puts (data writes) on a 
low-power CompactFlash. The slowdown ratio for a task 
is defined as its running-alone throughput divided by its 
throughput at the concurrent run. Higher slowdown ratio 
means worst performance. 


cards lack the sophisticated firmware and degree of par- 
allelism available in solid-state drives. Despite these dif- 
ferences, CompactFlash still exhibits some of the intrin- 
sic Flash characteristics that FIOS is designed to consider 
and exploit. 

We requested and acquired the FAWN Data Store ap- 
plication from the authors [2]. In our experiments, we 
concurrently run two FAWN Data Store tasks, one per- 
forming hash gets (data reads) and the other performing 
hash puts (data writes). We run hash puts synchronously 
to ensure that the data is made persistent before its re- 
sult is externalized to client. These tasks run against data 
stores of | million records. 


Figure 11 presents the resulting get/put slowdown ra- 
tios under different I/O schedulers. Only FIOS keeps 
both hash gets and puts below the proportional slow- 
down. The quanta scheduler also exhibits good fair- 
ness because its suppression of parallelism has no harm- 
ful effect on the CompactFlash which does not allow 
any I/O parallelism. Further, the quanta scheduler’s ex- 
cessive I/O anticipation causes little efficiency loss for 
FAWN Data Store that performs batched I/O with almost 
no inter-I/O thinktime. Under all other approaches (raw 
device I/O, Linux CFQ, and SFQ(D)), hash gets expe- 
rience worse performance degradation than the propor- 
tional slowdown, which indicates poor fairness. 


7 Conclusion 


Flash-based storage devices are capable of alleviating 
I/O bottlenecks in data-intensive applications. However, 
the unique performance characteristics of Flash storage 
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must be taken into account in order to fully exploit their 
superior I/O capabilities while offering fair access to ap- 
plications. In this paper, we have characterized the per- 
formance of several Flash-based storage devices. We 
observed that during concurrent access, writes can dra- 
matically affect the response time of read requests. We 
also observed that Flash-based storage exhibits support 
for some degree of parallel I/O, though the benefit of 
parallel I/O varies across devices. Further, the lack of 
seek/rotation overhead eliminates the performance bene- 
fit of anticipatory I/O, but proper I/O anticipation is still 
needed for the purpose of fairness. 


Based on these motivations, we designed a new Flash 
I/O scheduling approach that contains four essential 
techniques to ensure fairness with high efficiency—fair 
timeslice management that allows timeslice fragmenta- 
tion and concurrent request issuance, read/write interfer- 
ence management, I/O parallelism, and I/O anticipation 
for fairness. We implemented these design principles in 
a new I/O scheduler for Linux. 


We evaluated our I/O scheduler alongside three alter- 
native fairness-oriented I/O schedulers (Linux CFQ [3], 
SFQ(D) [18], and a quanta-based I/O scheduler similar 
to that in Argon [36]). Our evaluation uses a variety 
of synthetic benchmarks and realistic application work- 
loads on several Flash-based storage devices (including a 
CompactFlash card in a low-power wimpy node). The re- 
sults expose the shortcomings of existing I/O schedulers 
while validating our design principles for Flash resource 
management. In conclusion, this paper makes the case 
that fairness warrants the first-class concern in Flash I/O 
scheduling and it is possible to achieve fairness while at- 
taining high efficiency. 

While FIOS is primarily motivated by the Flash 
read/write interference, we also demonstrate that FIOS is 
beneficial for regulating the resource usage fairness be- 
tween read tasks with different I/O costs (a task perform- 
ing small reads runs concurrently with a task performing 
large reads). This illustrates the value of FIOS to support 
read-only workloads and workloads in which writes are 
asynchronous and delayed to the background. Further, 
FIOS is also valuable for Flash drives that have modest 
read/write performance discrepancy. 
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Abstract 


Redundancy elimination using data deduplication and 
incremental data processing has emerged as an important 
technique to minimize storage and computation require- 
ments in data center computing. In this paper, we present 
the design, implementation and evaluation of Shredder, 
a high performance content-based chunking framework 
for supporting incremental storage and computation sys- 
tems. Shredder exploits the massively parallel process- 
ing power of GPUs to overcome the CPU bottlenecks of 
content-based chunking in a cost-effective manner. Un- 
like previous uses of GPUs, which have focused on ap- 
plications where computation costs are dominant, Shred- 
der is designed to operate in both compute-and data- 
intensive environments. To allow this, Shredder provides 
several novel optimizations aimed at reducing the cost 
of transferring data between host (CPU) and GPU, fully 
utilizing the multicore architecture at the host, and re- 
ducing GPU memory access latencies. With our opti- 
mizations, Shredder achieves a speedup of over 5X for 
chunking bandwidth compared to our optimized parallel 
implementation without a GPU on the same host system. 
Furthermore, we present two real world applications of 
Shredder: an extension to HDFS, which serves as a basis 
for incremental MapReduce computations, and an incre- 
mental cloud backup system. In both contexts, Shred- 
der detects redundancies in the input data across succes- 
sive runs, leading to significant savings in storage, com- 
putation, and end-to-end completion times. 


1 Introduction 


With the growth in popularity of Internet services, on- 
line data stored in data centers is increasing at an ever- 
growing pace. In 2010 alone, mankind is estimated to 
have produced 1,200 exabytes of data [1]. As a result 
of this “data deluge,” managing storage and computation 
over this data has become one of the most challenging 
tasks in data center computing. 

A key observation that allows us to address this chal- 
lenge is that a large fraction of the data that is produced 


and the computations performed over this data are redun- 
dant; hence, not storing redundant data or performing re- 
dundant computation can lead to significant savings in 
terms of both storage and computational resources. To 
make use of redundancy elimination, there exist a se- 
ries of research and product proposals (detailed in §8) 
for performing data deduplication and incremental com- 
putations, which avoid storing or computing tasks based 
on redundant data, respectively. 


Both data deduplication schemes and incremental 
computations rely on storage systems to detect duplicate 
content. In particular, the most effective way to perform 
this detection is using content-based chunking, a tech- 
nique that was pioneered in the context of the LBFS [33] 
file system, where chunk boundaries within a file are dic- 
tated by the presence of certain content instead of a fixed 
offset. Even though content-based chunking is useful, it 
is a computationally demanding task. Chunking meth- 
ods need to scan the entire file contents, computing a fin- 
gerprint over a sliding window of the data. This high 
computational cost has caused some systems to simplify 
the fingerprinting scheme by employing sampling tech- 
niques, which can lead to missed opportunities for elim- 
inating redundancies [9]. In other cases, systems skip 
content-based chunking entirely, thus forgoing the op- 
portunity to reuse identical content in similar, but not 
identical files [22]. Therefore, as we get flooded with in- 
creasing amounts of data, addressing this computational 
bottleneck becomes a pressing issue in the design of stor- 
age systems for data center-scale systems. 


To address this issue we propose Shredder, a sys- 
tem for performing efficient content-based chunking to 
support scalable incremental storage and computations. 
Shredder builds on the observation that neither the exclu- 
sive use of multicore CPUs nor the specialized hardware 
accelerators is sufficient to deal with large-scale data in 
a cost-effective manner: multicore CPUs alone cannot 
sustain a high throughput, whereas the specialized hard- 
ware accelerators lack programmability for other tasks 
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and are costly. As an alternative, we explore employing 
modern GPUs to meet these high computational require- 
ments (while, as evidenced by prior research [23, 26], 
also allowing for a low operational cost). The applica- 
tion of GPUs in this setting, however, raises a significant 
challenge — while GPUs have shown to produce per- 
formance improvements for computation intensive ap- 
plications, where CPU dominates the overall cost enve- 
lope [23, 24, 26, 43, 44], it still remains to be proven that 
GPUs are equally as effective for data intensive applica- 
tions, which need to perform large data transfers for a 
significantly smaller amount of processing. 

To make the use of GPUs effective in the context of 
storage systems, we designed several novel techniques, 
which we apply to two proof-of-concept applications. In 
particular, Shredder makes the following technical con- 
tributions: 


GPU acceleration framework. We identified three key 
challenges in using GPUs for data intensive applications, 
and addressed them with the following techniques: 


e Asynchronous execution. To minimize the cost of 
transferring data between host (CPU) and GPU, we 
use a double buffering scheme. This enables GPUs 
to perform computations while simultaneously data 
is transferred in the background. To support this 
background data transfer, we also introduce a ring 
buffer of pinned memory regions. 


e Streaming pipeline. To fully utilize the availabil- 
ity of a multicore architecture at the host, we use 
a pipelined execution for the different stages of 
content-based chunking. 


e Memory coalescing. Finally, because of the high 
degree of parallelism, memory latencies in the GPU 
will be high due to the presence of random ac- 
cess across multiple bank rows of GPU memory, 
which leads to a higher number of conflicts. We 
address this problem with a cooperative memory ac- 
cess scheme, which reduces the number of fetch re- 
quests and bank conflicts. 


Use cases. We present two applications of Shredder to 
accelerate storage systems. The first case study is a 
system called Inc-HDFS, a file-system that is based on 
HDES but is designed to support incremental computa- 
tions for MapReduce jobs. Inc-HDFS leverages Shred- 
der to provide a mechanism for identifying similarities 
in the input data of consecutive runs of the same MapRe- 
duce job. In this way Inc-HDFS enables efficient incre- 
mental computation, where only the tasks whose inputs 
have changed need to be recomputed. The second case 
study is a backup architecture for a cloud environment, 
where VMs are periodically backed up. We use Shred- 
der on a backup server and use content-based chunking 
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to perform efficient deduplication and significantly im- 
prove backup bandwidth. 

We present experimental results that establish the ef- 
fectiveness of the individual techniques we propose, as 
well as the ability of Shredder to improve the perfor- 
mance of the two real-world storage systems. 

The rest of the paper is organized as follows. In Sec- 
tion 2, we provide background on content-based chunk- 
ing, and discuss specific architectural features of GPUs. 
An overview of the GPU acceleration framework and its 
scalability challenges are covered in Section 3. Section 4 
presents present a detailed system design, namely sev- 
eral performance optimizations for increasing Shredder’s 
throughput. We present the implementation and evalua- 
tion of Shredder in Section 5. We cover the two case 
studies in Section 6 and Section 7. Finally, we discuss 
related work in Section 8, and conclude in Section 9. 


2 Background 


In this section, we first present background on content- 
based chunking, to explain its cost and potential for par- 
allelization. We then provide a brief overview of the mas- 
sively parallel compute architecture of GPUs, namely 
their memory subsystem and its limitations. 


2.1 Content-based Chunking 


Identification of duplicate data blocks has been used for 
deduplication systems in the context of both storage [33, 
39] and incremental computation frameworks [14]. For 
storage systems, the duplicate data blocks need not to be 
stored and, in the case of incremental computations, a 
sub-computation based on the duplicate content may be 
reused. Duplicate identification essentially consists of: 


1. Chunking: This is the process of dividing the data 
set into chunks in a way that aids in the detection of 
duplicate data. 

2. Hashing: This is the process of computing a 
collision-resistant hash of the chunk. 

3. Matching: This is the process of checking if the 
hash for a chunk already exists in the index. If it ex- 
ists then there is a duplicate chunk, else the chunk 
is new and its hash is added to the index. 


This paper focuses on the design of chunking schemes 
(step 1), since this can be, in practice, one of the main 
bottlenecks of a system that tries to perform this class 
of optimizations [9, 22]. Thus we begin by giving some 
background on how chunking is performed. 

One of the most popular approaches for content-based 
chunking is to compute a Rabin fingerprint [40] over slid- 
ing windows of w contiguous bytes. The hash values 
produced by the fingerprinting scheme are used to create 
chunk boundaries by starting new chunks whenever the 
computed hash matches one of a set of markers (e.g., its 
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value mod p is lower or equal to a constant). In more 
detail, given a w-bit sequence, it is represented as a poly- 
nomial of degree w — | over the finite field GF (2): 


f(x) =mo Amyxtees tity _1x” | (1) 


Given this polynomial, an irreducible polynomial div(x) 
of degree k is chosen. The fingerprint of the original bit 
sequence is the remainder r(x) obtained by division of 
f(x) using div(x). Chunk boundary is defined when the 
fingerprint takes some pre-defined specific values called 
markers. In addition, practical schemes define a mini- 
mum min and maximum max chunk size, which implies 
that after finding a marker the fingerprint computation 
can skip min bytes, and that a marker is always set when 
a total of max bytes (including the skipped portion) have 
been scanned without finding a marker. The minimum 
size limits the metadata overhead for index management 
and the maximum size limits the size of the RAM buffers 
that are required. Throughout the rest of the paper, we 
will use min = 0 and max = ~ unless otherwise noted. 

Rabin fingerprinting is computationally very expen- 
sive. To minimize the computation cost, there has been 
work on reducing chunking time by using sampling tech- 
niques, where only a subset of bytes are used for chunk 
identification (e.g., SampleByte [9]). However, such ap- 
proaches are limiting because they are suited only for 
small sized chunks, as skipping a large number of bytes 
leads to missed opportunities for deduplication. Thus, 
Rabin fingerprinting still remains one of the most pop- 
ular chunking schemes, and reducing its computational 
cost presents a fundamental challenge for improving sys- 
tems that make use of duplicate identification. 

When minimum and maximum chunk sizes are not re- 
quired, chunking can be parallelized in a way that differ- 
ent threads operate on different parts of the data com- 
pletely independent of each other, with the exception 
of a small overlap of the size of the sliding window (w 
bytes) near partition boundaries. Using min and max 
chunk sizes complicates this task, though schemes exist 
to achieve efficient parallelization in this setting [29, 31]. 


2.2 General-Purpose Computing on GPUs 


GPU architecture. GPUs are highly parallel, multi- 
threaded, many-core processors with tremendous com- 
putational power and very high memory bandwidth. The 
high computational power is derived from the special- 
ized design of GPUs, where more transistors are de- 
voted to simple data processing units (ALUs) rather 
than used to integrate sophisticated pre-fetchers, control 
flows and data caches. Hence, GPUs are well-suited for 
data-parallel computations with high arithmetic intensity 
rather than data caching and flow control. 

Figure | illustrates a simplified architecture of a GPU. 
A GPU can be modeled as a set of Streaming Multipro- 


GPU (Device) 
Multiprocessor N 


Multiprocessor 2 





Multiprocessor 1 
PCI Shared 
Memory 





Figure 1: A simplified view of the GPU architecture. 


cessors (SMs), each consisting of a set of scalar proces- 
sor cores (SPs). An SM works as SIMT (Single Instruc- 
tion, Multiple Threads), where the SPs of a multiproces- 
sor execute the same instruction simultaneously but on 
different data elements. The data memory in the GPU 
is organized as multiple hierarchical spaces for threads 
in execution. The GPU has a large high-bandwidth de- 
vice memory with high latency. Each SM also contains 
a very fast, low latency on-chip shared memory to be 
shared among its SPs. Also, each thread has access to a 
private local memory. 

Overall, a GPU architecture differs from a traditional 
processor architecture in the following ways: (7) an or- 
der of magnitude higher number of arithmetic units; (ii) 
minimal support for prefetching and buffers for outstand- 
ing instructions; (iii) high memory access latencies and 
higher memory bandwidth. 


Programming model. The CUDA [6] programming 
model is amongst the most popular programming mod- 
els to extract parallelism and scale applications on GPUs. 
In this programming model, a host program runs on the 
CPU and launches a kernel program to be executed on 
the GPU device in parallel. The kernel executes as a grid 
of one or more thread blocks, each of which is dynam- 
ically scheduled to be executed on a single SM. Each 
thread block consists of a group of threads that cooper- 
ate with each other by synchronizing their execution and 
sharing multiprocessor resources such as shared memory 
and registers. Threads within a thread block get executed 
on a multiprocessor in scheduling units of 32 threads, 
called a warp. A half-warp is either the first or second 
half of a warp. 


2.3 SDRAM Access Model 


Offloading chunking to the GPU requires a large amount 
of data to be transferred from the host to the GPU mem- 
ory. Thus, we need to understand the performance of 
the memory subsystem in the GPU, since it is critical to 
chunking performance. 
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The global memory in the Nvidia C2050 GPU is 
GDDRS, which is based on the DDR3 memory archi- 
tecture [2]. Memory is arranged into banks and banks 
are organized into rows. Every bank also has a sense 
amplifier, into which a row must be loaded before any 
data from the row can be read by the GPU. Whenever 
a memory location is accessed, an ACT command se- 
lects the corresponding bank and brings the row contain- 
ing the memory location into a sense amplifier. The ap- 
propriate word is then transferred from the sense ampli- 
fier. When an access to a second memory location is 
performed within the same row, the data is transferred 
directly from the sense amplifier. On the other hand, if 
the data is accessed from a different row in the bank, 
a PRE (pre-charge) command writes the previous data 
back from the sense amplifier to the memory row. A sec- 
ond ACT command is performed to bring the row into 
the sense amplifier. 

Note that both ACT and PRE commands are high la- 
tency operations that contribute significantly to overall 
memory latency. If multiple threads access data from 
different rows of the same bank in parallel, that sense 
amplifier is continually activated (ACT) and pre-charged 
(PRE) with different rows, leading to a phenomenon 
called bank conflict. In particular, a high degree of un- 
coordinated parallel access to the memory subsystem is 
likely to result in a large number of bank conflicts. 


3 System Overview and Challenges 


In this section, we first present the basic design of Shred- 
der. Next, we explain the main challenges in scaling up 
our basic design. 


3.1 Basic GPU-Accelerated Framework 


Figure 2 depicts the workflow of the basic design for the 
Shredder chunking service. In this initial design, a multi- 
threaded program running in user mode on the host (.e., 
on the CPU) drives the GPU-based computations. The 
framework is composed of four major modules. First, the 
Reader thread on the host receives the data stream (e.g., 
from a SAN), and places it in the memory of the host for 
content-based chunking. After that, the Transfer thread 
allocates global memory on the GPU and uses the DMA 
controller to transfer input data from the host memory 
to the allocated GPU (device) memory. Once the data 
transfer from the CPU to the GPU is complete, the host 
launches the Chunking kernel for parallel sliding win- 
dow computations on the GPU. Once the chunking ker- 
nel finds all resulting chunk boundaries for the input data, 
the Store thread transfers the resulting chunk boundaries 
from the device memory to the host memory. When min- 
imum and maximum chunk sizes are set, the Store thread 
also adjusts the chunk set accordingly. Thereafter, the 
Store thread uses an upcall to notify the chunk bound- 
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Figure 2: Basic workflow of Shredder. 


aries to the application that is using the Shredder library. 
The chunking kernel is responsible for performing par- 
allel content-based chunking of the data present in the 
global memory of the GPU. Accesses to the data are per- 
formed by multiple threads that are created on the GPU 
by launching the chunking kernel. The data in the GPU 
memory is divided into equal sized sub-streams, as many 
as the number of threads. Each thread is responsible for 
handling one of these sub-streams. For each sub-stream, 
a thread computes a Rabin fingerprint in a sliding win- 
dow manner. In particular, each thread examines a 48- 
byte region from its assigned sub-stream, and computes 
the Rabin fingerprint for the selected region. The thread 
compares the resulting low-order 13 bits of the region’s 
fingerprint with a pre-defined marker. This leads to an 
expected chunk size of 4 KB. If the fingerprint matches 
the marker then the thread defines that particular region 
as the end of a chunk boundary. The thread continues to 
compute the Rabin fingerprint in a sliding window man- 
ner in search of new chunk boundaries by shifting a byte 
forward in the sub-stream, and repeating this process. 


3.2 Scalability Challenges 


The basic design for Shredder that we presented in 
the previous section corresponds to the traditional way 
in which GPU-assisted applications are implemented. 
This design has proven to be sufficient for computation- 
intensive applications, where the computation costs can 
dwarf the cost of transferring the data to the GPU mem- 
ory and accessing that memory from the GPU’s cores. 
However, it results in only modest performance gains 
for data intensive applications that perform single-pass 
processing over large amounts of data, with a compu- 
tational cost that is significantly lower than traditional 
GPU-assisted applications. 

To understand why this is the case, we present in Ta- 
ble 1 some key performance characteristics of a specific 
GPU architecture (NVidia Tesla C2050), which helps us 
explain some important bottlenecks for GPU-accelerated 
applications. In particular, and as we will demonstrate in 
subsequent sections, we identified the following bottle- 
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Table 1: Performance characteristics of the GPU (NVidia 
Tesla C2050) 


necks in the basic design of Shredder. 


GPU device memory bottleneck. The fact that data 
needs to be transferred to the GPU memory before being 
processed by the GPU represents a serial dependency: 
such processing only starts to execute after the corre- 
sponding transfer concludes. 


Host bottleneck. The host machine performs three se- 
rialized steps (performed by the Reader, Transfer, and 
Store threads) in each iteration. Since these three steps 
are inherently dependent on each other for a given input 
buffer, this serial execution becomes a bottleneck at host. 
Also, given the availability of multicore architecture at 
the host, this serialized execution leads to an underuti- 
lization of resources at host. 


High memory latencies and bank conflicts. The global 
device memory on the GPU has a high latency, of the 
order of 400 to 600 cycles. This works well for HPC 
algorithms, which are quadratic O(N7) or a higher de- 
gree polynomial in the input size N, since the compu- 
tation time hides the memory access latencies. Chunk- 
ing 1s also compute intensive, but it is only linear in 
the input size (O(N), though the constants are high). 
Hence, even though the problem is compute intensive 
on traditional CPUs, on a GPU with an order of magni- 
tude larger number of scalar cores, the problem becomes 
memory-intensive. In particular, the less sophisticated 
memory subsystem of the GPU (without prefetching or 
data caching support) is stressed by frequent memory ac- 
cess by a massive number of threads in parallel. Fur- 
thermore, a higher degree of parallelism causes memory 
to be accessed randomly across multiple bank rows, and 
leads to a very high number of bank conflicts. As a re- 
sult, it becomes difficult to hide the latencies of accesses 
to the device memory. 


4 Shredder Optimizations 


In this section, we describe several novel optimizations 
that extend the basic design to overcome the challenges 
we highlighted in the previous section. 
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Figure 3: Bandwidth test between host and device. 


4.1 Device Memory Bottlenecks 


4.1.1 Concurrent Copy and Execution 


The main challenge we need to overcome is the fact that 
traditional GPU-assisted applications that follow the ba- 
sic design were designed for a scenario where the cost of 
transferring data to the GPU is significantly outweighed 
by the actual computation cost. In particular, the ba- 
sic design serializes the execution of copying data to the 
GPU memory and consuming the data from that memory 
by the Kernel thread. This serialized execution may not 
suit the needs of data intensive applications, where the 
cost of the data transfer step becomes a more significant 
fraction of the overall computation time. 

To understand the magnitude of this problem, we mea- 
sured the overhead of a DMA transfer of data between 
the host and the device memory over the PCIe link con- 
nected to GPU. Figure 3 summarizes the effective band- 
width between host memory and device memory for dif- 
ferent buffer sizes. We measured the bandwidth both 
ways between the host and the device to gauge the DMA 
overhead for the Transfer and the Store thread. Note 
that the effective bandwidth is a property of the DMA 
controller and the PCI bus, and it is independent of the 
number of threads launched in the GPU. In this experi- 
ment, we also varied the buffer type allocated for the host 
memory region, which is allocated either as pageable or 
pinned memory regions. (The need for pinned memory 
will become apparent shortly.) 


Highlights. Our measurements demonstrate the follow- 
ing: (1) small sized buffer transfers are more expensive 
than those using large sized buffers; (11) the throughput 
saturates for buffer sizes larger than 32 MB (for page- 
able memory region) and 256 KB (for pinned memory 
region); (111) for large sized buffers (greater than 32 MB), 
the throughput difference between pageable and pinned 
memory regions is not significant; and (iv) the effective 
bandwidth of the PCIe bus for data transfer is on the or- 
der of 5 GB/sec, whereas the global device memory ac- 
cess time by scalar processors in GPUs is on the order of 
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Figure 4: Concurrent copy and execution. 


144 GB/sec, an order of magnitude higher. 


Implications. The time spent to chunk a given buffer is 
split between the memory transfer and the kernel com- 
putation. For a non-optimized implementation of the 
chunking computation, we spend approximately 25% of 
the time performing the transfer. Once we optimize the 
processing in the GPU, the host to GPU memory trans- 
fer may become an even greater burden on the overall 
performance. 


Optimization. In order to avoid the serialized execution 
of the copy and data consumption steps, we propose to 
overlap the copy and the execution phases, thus allowing 
for the concurrent execution of data communication and 
the chunking kernel computations. To enable this, we 
designed a double buffering technique as shown in Fig- 
ure 4, where we partition the device memory into twin 
buffers. These twin buffers will be alternatively used 
for communication and computation. In this scheme, the 
host asynchronously copies the data into the first buffer 
and, in the background, the device works on the previ- 
ously filled second buffer. To be able to support asyn- 
chronous communication, the host buffer is allocated as 
a pinned memory region, which prevents the region from 
being swapped out by the pager. 


Effectiveness. Figure 5 shows the effectiveness of 
the double buffering approach, where the histogram for 
transfer and kernel execution shows a 30% time over- 
lap between the concurrent copy and computation. Even 
though the total time taken for concurrent copy and ex- 
ecution (Concurrent) is reduced by only 15% as com- 
pared to the serialized execution (Serialized), it is im- 
portant to note that the total time is now dictated solely 
by the compute time. Hence, double buffering is able to 
remove the data copying time from the critical path, al- 
lowing us to focus only on optimizing the computation 
time in the GPU (which we address in § 4.3). 

To support the concurrent copy and execution, how- 
ever, requires us to pin memory at the host, which re- 
duces the memory allocation performance at the host. 
We next present an optimization to handle this side effect 
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Figure 5: Normalized overlap time of communication 
with computation with varied buffer sizes for 1GB data. 


and ensure that double buffering leads to an end-to-end 
increase in chunking bandwidth. 


4.1.2 Circular Ring Pinned Memory Buffers 


As explained above, the double buffering requires an 
asynchronous copy between host memory and device 
memory. To support this asynchronous data transfer, the 
host side buffer should be allocated as a pinned mem- 
ory region. This locks the corresponding page so that 
accessing that region does not result in a page fault until 
the region is subsequently unpinned. 

To quantify the allocation overheads of using a pinned 
memory region, we compared the time required for dy- 
namic memory allocation (using malloc) and pinned 
memory allocation (using the CUDA memory allocator 
wrapper). Since Linux follows an optimistic memory al- 
location strategy, where the actual allocation is deferred 
until memory initialization, in our measurements we ini- 
tialized the memory region (using bzero) to force the 
kernel to allocate the desired buffer size. Figure 6 com- 
pares the allocation overhead of pageable and pinned 
memory for different buffer sizes. 


Highlights. The important take away points are the fol- 
lowing: (i) pinned memory allocation is more expensive 
than the normal dynamic memory allocation; and (11) an 
adverse side effect of having too many pinned memory 
pages is that it can increase paging activity for unpinned 
pages, which degrades performance. 


Implications. The main implication for our system de- 
sign is that we need to minimize the allocation of pinned 
memory region buffers, to avoid increased paging activ- 
ity or even thrashing. 


Optimization. To minimize the allocation of pinned 
memory region while restricting ourselves to using the 
CUDA architecture, we designed a circular ring buffer 
built from a pinned memory region, as shown in Fig- 
ure 7, with the property that the number of buffers can 
be kept low (namely as low as the number of stages in 
the streaming pipeline, as described in §4.2). The pinned 
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Figure 6: Comparison of allocation overhead of pageable 
with pinned memory region. 


regions in the circular buffer are allocated only once dur- 
ing the system initialization, and thereafter are reused in 
a round-robin fashion after the transfer between the host 
and the device memory is complete. This allows us to 
keep the overhead of costly memory allocation negligi- 
ble and have sufficient memory pages for other tasks. 


Effectiveness. Figure 6 shows the effectiveness of our 
approach, where we compare the time for allocating 
pageable and pinned memory regions. Since we incur 
the additional cost of copying the data from pageable 
memory to the pinned memory region, we add this cost 
to the total cost of using pageable buffers. Overall, our 
approach is faster by an order of magnitude, which high- 
lights the importance of this optimization. 


4.2 Host Bottleneck 


The previously stated optimizations alleviate the device 
memory bottleneck for DMA transfers, and allow the 
device to focus on performing the actual computation. 
However, the host side modules can still become a bot- 
tleneck due to the serialized execution of the following 
stages (Reader—Transfer—Kernel—Store). In this 
case, the fact that all four modules are serially executed 
leads to an underutilization of resources at the host side. 
To quantify this underutilization at the host, we mea- 
sured the number of idle spare cycles per core after the 
launch of the asynchronous execution of the kernel. Ta- 
ble 2 shows the number of RDTSC tick cycles for dif- 
ferent buffer sizes. The RDTSC [8] (Read-Time Stamp 
Counter) instruction keeps an accurate count of every 
cycle that occurs on the processor for monitoring the 
performance. The device execution time captures the 
asynchronous copy and execution of the kernel, and the 
host kernel launch time measures the time for the host to 
launch the asynchronous copy and the chunking kernel. 


Highlights. These measurements highlight the follow- 
ing: (1) the kernel launch time is negligible compared to 
the total execution time for the kernel; (11) the host is idle 
during the device execution time; and (11) the host has a 
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Figure 7: Ring buffer for the pinned memory region. 


Buffer size (bytes) | 16M | 32M | 64M |128M| 256M 
Device execution time (ms) 171.4 
0.08 | 0.09 

85.78|171.49 


Total execution time (ms) 
3.368 


Host RDTSC ticks @ 2.67 GHz 2.78] 5. 


Host kernel launch time (ms) 


Table 2: Host spare cycles per core due to asynchronous 
data-transfer and kernel launch. 


large number of spare cycles per core, even with a small 
sized buffer. 


Implications. Given the prevalence of host systems run- 
ning on multicore architectures, the sequential execution 
of the various components leads to the underutilization of 
the host resources, and therefore these resources should 
be used to perform other operations. 


Optimization. To utilize these spare cycles at the host, 
Shredder makes use of a multi-stage streaming pipeline 
as shown in Figure 8. The goal of this design is that 
once the Reader thread finishes writing the data in the 
host main memory, it immediately proceeds to handling 
anew window of data in the stream. Similarly, the other 
threads follow this pipelined execution without waiting 
for the next stage to finish. 

To handle the specific characteristics of our pipeline 
stages, we use different design strategies for different 
modules. Since the Reader and Store modules deal 
with I/O, they are implemented as Asynchronous I/O 
(as described in §5.2.1), whereas the transfer and kernel 
threads are implemented using multi-buffering (a gen- 
eralization of the double buffering scheme described in 
84.1.1). 


Effectiveness. Figure 9 shows the average speedup from 
using our streaming pipeline, measured as the ratio of 
time taken by a sequential execution to the time taken 
by our multi-stage pipeline. We varied the number of 
pipeline stages that can be executed simultaneously (by 
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Figure 8: Multi-staged streaming pipeline. 
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Figure 9: Speedup for streaming pipelined execution. 


restricting the number of buffers that are admitted to 
the pipeline) from 2 to 4. The results show that a full 
pipeline with all four stages being executed simultane- 
ously achieves a speedup of 2; the reason why this is 
below the theoretical maximum of a 4X gain is that the 
various stages do not have equal cost. 


4.3 Device Memory Conflicts 


We have observed (in Figure 5) that the chunking ker- 
nel dominates the overall time spent by the GPU. In this 
context, it is crucial to try to minimize the contribution 
of the device memory access latency to the overall cost. 


Highlights. The very high access latencies of the device 
memory (on the order of 400-600 cycles @ 1.15 GHz) 
and the lack of support for data caching and prefetching 
can imply a significant overhead in the overall execution 
time of the chunking kernel. 


Implications. The hierarchical memory of GPUs pro- 
vides us an opportunity to hide the latencies of the global 
device memory by instead making careful use of the 
low latency shared memory. (Recall from § 2.2 that the 
shared memory is a fast and low latency on-chip mem- 
ory which is shared among a subset of the GPU’s scalar 
processors.) However, fetching data from global to the 
shared memory requires us to be careful to avoid bank 
conflicts, which can negatively impact the performance 
of the GPU memory subsystem. This implies that we 
should try to improve the inter-thread coordination in 
fetching data from the device global memory to avoid 
these bank conflicts. 
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Figure 10: Memory coalescing to fetch data from global 
device memory to the shared memory. 


Optimization. We designed a thread cooperation mech- 
anism to optimize the process of fetching data from the 
global memory to the shared memory, as shown in Fig- 
ure 10. In this scheme, a single block that is needed by a 
given thread is fetched at a time, but each block is fetched 
with the cooperation of all the threads, and their coordi- 
nation to avoid bank conflicts. The idea is to iterate over 
all data blocks for all threads in a thread block, fetch one 
data block at a time in a way that different threads request 
consecutive but non-conflicting parts of the data block, 
and then, after all data blocks are fetched, let each thread 
work on its respective blocks independently. This is fea- 
sible since threads in a warp (or half-warp) execute the 
same stream of instructions (SIMT). Figure 10 depicts 
how threads in a half-warp cooperate with each other to 
fetch different blocks sequentially in time. 


In order to ensure that the requests made by different 
threads when fetching different parts of the same data 
block do not conflict, we followed the best practices sug- 
gested by the device manufacturer to ensure these re- 
quests correspond to a single access to one row in a 
bank [6, 7, 42]. In particular, Shredder lets multiple 
threads of a half-warp read a contiguous memory inter- 
val simultaneously, under following conditions: (1) the 
size of the memory element accessed by each thread is 
either 4, 8, or 16 bytes; (11) the elements form a contigu- 
ous block of memory; i.e, the Nth element is accessed by 
the Nth thread in the half-warp; and (111) the address of 
the first element is aligned at a boundary of a multiple of 
16 bytes. 


USENIX Association 


USENIX Association 


1000 
Device Memory === 

900 Memory Coalescing — 
800 
700 
600 
500 
400 
300 
200 
100 


Time (ms) 





16M 32M 64M 
Buffer Size (bytes) 


128M 256M 


Figure 11: Normalized chunking kernel time with varied 
buffer-sizes for 1 GB data. 


Effectiveness. Figure 11 shows the effectiveness of the 
memory coalescing optimization, where we compare the 
execution time for the chunking kernel using the normal 
device memory access and the optimized version. The 
results show that we improve performance by a factor of 
8 by reducing bank conflicts. Since the granularity of 
memory coalescing is 48 KB (which is the size for the 
shared memory per thread block), we do not see any im- 
pact from varying buffer sizes (16 MB to 512 MB), and 
the benefits are consistent across different buffer sizes. 


5 Implementation and Evaluation 


We implemented Shredder in CUDA [6], and for an ex- 
perimental comparison, we also implemented an opti- 
mized parallel host-only version of content-based chunk- 
ing. This section describes these implementations and 
evaluates them. 


5.1 Host-Only Chunking using pthreads 


We implemented a library for parallel content-based 
chunking on SMPs using POSIX pthreads. We derived 
parallelism by creating pthreads that operate in differ- 
ent data regions using a Single Program Multiple Data 
(SPMD) strategy and communicate using a shared mem- 
ory data structure. At a high level, the implementation 
works as follows: (1) divide the input data equally in 
fixed-size regions among N threads; (2) invoke the Ra- 
bin fingerprint-based chunking algorithm in parallel on 
N different regions; (3) synchronize neighboring threads 
in the end to merge the resulting chunk boundaries. 

An issue that arises is that dynamic memory allocation 
can become a bottleneck due to the the serialization re- 
quired to avoid race conditions. To address this, we used 
the Hoard memory allocator [12] instead of malloc. 


5.2 Shredder Implementation 


The Shredder library implementation comprises two 
main modules, the host driver and the GPU kernel. The 
host driver runs the control part of the system as a multi- 
threaded process on the host CPU running Linux. The 


GPU kernel uses one or more GPUs as co-processors for 
accelerating the SIMT code, and is implemented using 
the CUDA programming model from the NVidia GP- 
GPU toolkit [6]. Next we explain key implementation 
details for both modules. 


5.2.1 Host Driver 


The host driver module is responsible for reading the in- 
put data either from the network or the disk and trans- 
ferring the data to the GPU memory. Once the data is 
transferred then the host process dispatches the GPU ker- 
nel code in the form of RPCs supported by the CUDA 
toolkit. The host driver has two types of function- 
ality: (1) the Reader/Store threads deal with reading 
and writing data from and to I/O channels; and (2) 
the Transfer thread is responsible for moving data be- 
tween the host and the GPU memory. We implemented 
the Reader/Store threads using Asynchronous I/O and 
the Transfer thread using CUDA RPCs and page-pinned 
memory. 


Asynchronous I/O (AIO). With asynchronous non- 
blocking I/O, it is possible to overlap processing and I/O 
by initiating multiple transfers at the same time. In AIO, 
the read request returns immediately, indicating that the 
read was successfully initiated. The application can then 
perform other processing while the background read op- 
eration completes. When the read response arrives, a sig- 
nal registered with the read request is triggered to signal 
the completion of the I/O transaction. 

Since the Reader/Store threads operate at the granular- 
ity of buffers, a single input file I/O may lead to issuing 
multiple aio-read system calls. To minimize the over- 
head of multiple context switches per buffer, we used 
lio-listio to initiate multiple transfers at the same 
time in the context of a single system call (meaning one 
kernel context switch). 


5.2.2 GPU Kernel 


The GPU kernel can be trivially derived from the C 
equivalent code by implementing a collection of func- 
tions in equivalent CUDA C with some assembly anno- 
tations, plus different access mechanisms for data layout 
in the GPU memory. However, an efficient implementa- 
tion of the GPU kernel requires a bit more understanding 
of vector computations and the GPU architecture. We 
briefly describe some of these considerations. 


Kernel optimizations. We have implemented minor ker- 
nel optimizations to exploit vector computation in GPUs. 
In particular, we used loop unrolling and instruction- 
level optimizations for the core Rabin fingerprint block. 
These changes are important because of the simpli- 
fied GPU architecture, which lacks out-of-order execu- 
tion, pipeline stalling in register usage, or instruction 
reordering to eliminate Read-after-Write (RAW) depen- 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


We 


180 


CPU w/o Hoard 








GPU Streams 
GPU Streams + Memory 

















Throughput [GBps] 

















CARA AAA 
8252505250525 


V 
S 


CPU GPU 


Figure 12: Throughput comparison of content-based 
chunking between CPU and GPU versions. 


dencies. 


Warp divergence. Since the GPU architecture is Single 
Instruction Multiple Threads (SIMT), if threads in a warp 
diverge on a data-dependent conditional branch, then the 
warp is serially executed until all threads in it converge 
to the same execution path. To avoid a performance dip 
due to this divergence in warp execution, we carefully 
restructured the algorithm to have little code divergence 
within a warp, by minimizing the code path under data- 
dependent conditional branches. 


5.3. Evaluation of Shredder 


We now present our experimental evaluation of the per- 
formance of Shredder. 


Experimental setup. We used a fermi-based GPU ar- 
chitecture, namely the Tesla C2050 GPU consisting of 
448 processor cores (SPs). It is organized as a set of 14 
SMs each consisting of 32 SPs running at 1.15 GHz. It 
has 2.6 GB of off-chip global GPU memory providing 
a peak memory bandwidth of 144 GB/s. Each SM has 
32768 registers and 48 KB of local on-chip shared mem- 
ory, shared between its scalar cores. 

We also used an Intel Xeon processor based system 
as the host CPU machine. The host system consists of 
12 Intel(R) Xeon(R) CPU X5650 @ 2.67 GHz with 48 
GB of main memory. The host machine is running Linux 
with kernel 2.6.38 in 64-bit mode, additionally patched 
with GPU direct technology [4] (for SAN devices). The 
GCC 4.3.2 compiler (with -O3) was used to compile the 
source code of the host library. The GPU code is com- 
piled using the CUDA toolkit 4.0 with NVidia driver ver- 
sion 270.41.03. The posix implementation is run with 12 
threads. 


Results. | We measure the effectiveness of GPU- 
accelerated content-based chunking by comparing the 
performance of different versions of the host-only and 
GPU based implementation, as shown in Figure 12. We 
compare the chunking throughput for the pthreads imple- 
mentation with and without using the Hoard memory al- 
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Figure 13: Incremental computations using Shredder. 


locator. For the GPU implementation, we compared the 
performance of the system with different optimizations 
turned on, to gauge their effectiveness. In particular, 
GPU Basic represents a basic implementation without 
any optimizations. The GPU Streams version includes 
the optimization to remove host and device bottlenecks 
using double buffering and a 4-stage pipeline. Lastly 
GPU Streams + Memory represents a version with all 
optimizations, including memory coalescing. 

Our results show that a naive GPU implementation can 
lead to a 2X improvement over a host-only optimized im- 
plementation. The observation clearly highlights the po- 
tential of GPUs to alleviate computational bottlenecks. 
However, this implementation does not remove chunk- 
ing as a bottleneck since SAN bandwidths on typical data 
servers exceed 10 Gbps. Incorporating the optimizations 
lead to Shredder outperforming the host-only implemen- 
tation by a factor of over 5X. 


6 Case Study I: Incremental Computations 


This section presents a case study of applying Shredder 
in the context of incremental computations. First we re- 
view Incoop, a system for bulk incremental processing, 
and then describe how we used Shredder to improve it. 


6.1 Background: Incremental MapReduce 


Incoop [14] is a generic MapReduce framework for in- 
cremental computations. Incoop leverages the fact that 
data sets that are processed by bulk data processing 
frameworks like MapReduce evolve slowly, and often the 
same computation needs to be performed repeatedly on 
this changing data (such as computing PageRank on ev- 
ery new web crawl) [21, 32, 34]. Incoop aims at pro- 
cessing this data incrementally, by avoiding recomputing 
parts of the computation that did not change, and trans- 
parently, by being backwards-compatible with the inter- 
face used by MapReduce frameworks. 

To achieve these goals, Incoop employs a fine-grained 
result reuse mechanism, which captures a dependence 
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Figure 14: Shredder enabled chunking in Inc-HDFS. 


graph among inputs and sub-computations, propagates 
changes along that graph so that only sub-computations 
that have changed need to be recomputed, and uses 
memoization to be able to reuse outputs from sub- 
computations whose inputs did not change. Incoop uses 
the Inc-HDFS file system (an extension to HDFS) to 
identify changes in the input and propagate them. 


6.2 GPU-Accelerated Incremental HDFS 


We use Shredder to support Incoop by designing a GPU- 
accelerated Incremental HDFS (Inc-HDFS), which is in- 
tegrated with Incoop as shown in Figure 13. Inc-HDFS 
leverages Shredder to perform content-based chunking 
instead of using fixed-size chunking as in the original 
HDES, thus ensuring that small changes to the input lead 
to small changes in the set of chunks that are provided as 
input to Map tasks. This enables the results of the com- 
putations performed by most Map tasks to be reused. 


6.3. Implementation and Evaluation 


We built our prototype GPU-accelerated Inc-HDFS on 
Hadoop-0.20.2. It is implemented as an extension to 
HDEFS, where the computationally expensive chunking 
is offloaded to the Shredder-enabled HDFS client (as 
shown in Figure 14), before uploading chunks to the re- 
spective data nodes that will be storing them. 


Inc-HDES client. We integrated the Shredder library 
with Inc-HDFS client using a JAVA-CUDA interface. 
Once the data upload function is invoked, the Shredder li- 
brary notifies the chunk boundaries to the Store thread, 
which in turn pushes the chunks from the memory of the 
client to the data nodes of HDFS. 


Semantic chunking framework. The default behav- 
ior of the Shredder library is to split the input file into 
variable-length chunks based on the contents. However, 
since chunking is oblivious to the semantics of the input 
data, this could cause chunk boundaries to be placed any- 
where, including, for instance, in the middle of a record 
that should not be broken. To address this, we lever- 
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Figure 15: Speedup for incremental computation 


age the fact that the MapReduce framework relies on the 
InputFormat class of the job to split up the input file(s) 
into logical InputSplits, each of which is then assigned to 
an individual Map task. We reuse this class to ensure that 
we respect the record boundaries in the chunking pro- 
cess. 


HDFS shell. We extended the HDFS shell interface to 
invoke content-based chunking using the Shredder im- 
plementation. In particular, the shell interface offers new 
command (in addition to copyFromLocal1) for upload- 
ing data in Inc-HDFS: copyFromLocalGPU. 


Evaluation. We evaluated the effectiveness of incre- 
mental computations by measuring the speedups w.r.t. 
Hadoop for varying percentages of changes in the in- 
put data. Figure 15 shows the performance gains on 
a 20-node cluster, where all three MapReduce appli- 
cations (K-means, Word-Count, Co-occurrence Matrix) 
show significant improvement in run-time for incremen- 
tal runs. The effectiveness of the incremental approach 
degrades as the percentage of changes in the data set in- 
creases. Note that this experiment is not meant to high- 
light the speedup enabled by GPU acceleration, but in- 
stead shows how, after the data is chunked using Shred- 
der, detecting duplicates at the storage level can imply 
savings in computation time. 


7 Case Study II: Incremental Storage 


In this section, we present our second case study where 
we use Shredder in the context of a consolidated incre- 
mental backup system. 


7.1 Background: Cloud Backup 


Figure 16 describes our target architecture, which is typ- 
ical of cloud back-ends. Applications are deployed on 
virtual machines hosted on physical servers. The file 
system images of the virtual machines are hosted in a 
virtual machine image repository stored in a SAN vol- 
ume. In this scenario, the backup process works in the 
following manner. Periodically, full image snapshots are 
taken for all the VM images that need to be backed up. 
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Figure 16: A typical cloud backup architecture 


The core of the backup process is a backup server and 
a backup agent running inside the backup server. The 
image snapshots are mounted by the backup agent. The 
backup server performs the actual backup of the image 
snapshots onto disks or tapes. The consolidated or cen- 
tralized data backup process ensures compliance of all 
virtual machines with the agreed upon backup policy. 
Backup servers typically have very high I/O bandwidth 
since, in enterprise environments, all operations are typ- 
ically performed on a SAN [28]. Furthermore, the use 
of physical servers allows multiple dedicated ports to be 
employed solely for the backup process. 


7.2 GPU-Accelerated Data Deduplication 


The centralized backup process is eminently suitable 
for deduplication via content-based chunking, as most 
images in a data-center environment are standardized. 
Hence, virtual machines share a large number of files 
and a typical backup process would unnecessarily copy 
the same content multiple times. To exploit this fact, we 
integrate Shredder with the backup server, thus enabling 
data to be pushed to the backup site at a high rate while 
simultaneously exploiting opportunities for savings. 

The Reader thread on the backup server reads the 
incoming data and pushes that into Shredder to form 
chunks. Once the chunks are formed, the Store thread 
computes a hash for the overall chunk, and pushes the 
chunks in the backup setup as a separate pipeline stage. 
Thereafter, these hashes collected for the chunks are 
batched together to enqueue in an index lookup queue. 
Finally, a lookup thread picks up the enqueued chunk 
fingerprints and looks up in the index whether a partic- 
ular chunk needs to be backed up or is already present 
in the backup site. If a chunk already exists, a pointer 
to the original chunk is transferred instead of the chunk 
data. We deploy an additional Shredder agent residing on 
the backup site, which receives all the chunks and point- 
ers and recreates the original uncompressed data. The 
overall architecture for integrating Shredder in a cloud 
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backup system is described in Figure 17. 


7.3. Implementation and Evaluation 


Since high bandwidth fibre channel adapters are fairly 
expensive, we could not recreate the high I/O rate of 
modern backup servers in our testbed. Hence, we used 
a memory-driven emulation environment to experimen- 
tally validate the performance of Shredder. On our 
backup agent, we keep a master image in memory us- 
ing memcached [5]. The backup agent creates new file 
system images from the master image by replacing part 
of the content from the master image using a predefined 
similarity table. The master image is divided into seg- 
ments. The image similarity table contains a probability 
of each segment being replaced by a different content. 
The agent uses these probabilities to decide which seg- 
ments in the master image will be replaced. The image 
generation rate is kept at 10 Gbps to closely simulate the 
I/O processing rate of modern X-series employed for I/O 
processing applications [28]. 

In this experiment, we also enable the requirement of 
a minimum and maximum chunk size, as used in practice 
by many commercial backup systems. As mentioned in 
Section 3, our current implementation of Shredder is not 
optimized for including a minimum and maximum chunk 
size, since the data that is skipped after a chunk bound- 
ary is still scanned for computing a Rabin fingerprint on 
the GPU, and only after all the chunk boundaries are col- 
lected will the Store thread discard all chunk boundaries 
within the minimum chunk size limit. As future work, 
we intend to address this limitation using more efficient 
techniques that were proposed in the literature [29, 31]. 

As a result of this limitation, we observe in Fig- 
ure 18 that we are able to achieve a speedup of only 
2.5X in backup bandwidth compared to the pthread im- 
plementation, but still we manage to keep the backup 
bandwidth close to the target 10 Gbps. The results 
also show that even though the chunking process oper- 
ates independently of the degree of similarity in input 
data, the backup bandwidth decreases when the similar- 
ity between the data decreases. This is not a limitation 
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Figure 18: Backup bandwidth improvement due to 
Shredder with varying image similarity ratios 


of our chunking scheme but of the unoptimized index 
lookup and network access, which reduces the backup 
bandwidth. Combined with optimized index mainte- 
nance (e.g., [17]), Shredder is likely to achieve the tar- 
get backup bandwidth for the entire spectrum of content 
similarity. 


$8 Related Work 


Our work builds on contributions from several different 
areas, which we briefly survey. 


GPU-accelerated systems. GPUs were initially de- 
signed for graphics rendering, but, because of their cost- 
effectiveness, they were quickly adopted by the HPC 
community for scientific computations [3, 35]. Re- 
cently, the systems research community has leveraged 
GPUs for building other systems. In particular, Pack- 
etShader [23] is a software router for general packet pro- 
cessing, and SSLShader [26] uses GPUs in web servers 
to efficiently perform cryptographic operations. GPUs 
have also been used to accelerate functions such as pat- 
tern matching [44], network coding [43], and complex 
cryptographic operations [24]. In our work, we explored 
the potential of GPUs for large scale data, which raises 
challenges due to the overheads of data transfer. Re- 
cently, GPUs were used in software-based RAID con- 
trollers [16] for performing high-performance calcula- 
tions of error correcting codes. However, this work does 
not propose optimizations for efficient data transfer. 


Incremental Computations. Since modifying the out- 
put of a computation incrementally is asymptotically 
more efficient than recomputing everything from scratch, 
researchers and practitioners have built a wide range 
of systems and algorithms for incremental computa- 
tions [14, 21, 25, 32, 34, 36, 37]. Our proposal speeds 
up the process of change identification in the input and is 
complementary to these systems. 

Incremental Storage. Data deduplication is commonly 
used in storage systems. In particular, there is a large 
body of research on efficient index management [13, 17, 


30, 46, 47]. In this paper, we focus on the complemen- 
tary problem of content-based chunking [20, 27, 33]. 
High throughput content-based chunking is particularly 
relevant in environments that use SANs, where chunk- 
ing can become a bottleneck. To overcome this bottle- 
neck, systems have compromised the deduplication effi- 
ciency with sampling techniques or fixed-size chunking, 
or they have tried to scale chunking by deploying multi- 
node systems [15, 18, 19, 45]. A recent proposal shows 
that multi-node systems not only incur a high cost but 
also increase the reference management burden [22]. As 
a result, building a high throughput, cost-effective, single 
node systems becomes more important. Our system can 
be seen as an important step in this direction. 


Network Redundancy Elimination. Content-based 
chunking has also been proposed in the context of 
redundancy elimination for content distribution net- 
works (CDNs), to reduce the bandwidth consumption of 
ISPs [9, 10, 11, 38]. Also, many commercial vendors 
(such as Riverbed, Juniper, Cisco) offer middleboxes to 
improve bandwidth usage in multi-site enterprises, data 
centers and ISP links. Our proposal is complementary to 
this work, since it can be used to improve the throughput 
of redundancy elimination in such solutions. 


9 Conclusions and Future Work 


In this paper we have presented Shredder, a novel frame- 
work for content-based chunking using GPU accelera- 
tion. We applied Shredder to two incremental storage 
and computation applications, and our experimental re- 
sults show the effectiveness of the novel optimizations 
that are included in the design of Shredder. 


There are several interesting avenues for future work. 
First, we would like to incorporate into the library 
several optimizations for parallel content-based chunk- 
ing [29, 31]. Second, our proposed techniques need to 
continuously adapt to changes in the technologies that 
are used by GPUs, such as the use of high-speed Infini- 
Band networking, which enables further optimizations in 
the packet I/O engine using GPU-direct [4]. Third, we 
would like explore new applications like middleboxes for 
bandwidth reduction using network redundancy elimina- 
tion [10]. Finally, we would like to incorporate Shredder 
as an extension to recent proposals to devise new operat- 
ing system abstractions to manage GPUs [41]. 
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Abstract 


Historically, storage controllers have been extended by 
integrating new code, e.g., file serving, database process- 
ing, deduplication, etc., into an existing base. This in- 
tegration leads to complexity, co-dependency and insta- 
bility of both the original and new functions. Hypervi- 
sors are a known mechanism to isolate different func- 
tions. However, to enable extending a storage controller 
by providing new functions in a virtual machine (VM), 
the virtualization overhead must be negligible, which is 
not the case in a straightforward implementation. This 
paper demonstrates a set of mechanisms and techniques 
that achieve near zero runtime performance overhead for 
using virtualization in the context of a storage system. 


1 Introduction 


Additional functions, such as file serving or database, are 
often added to existing storage systems to meet new re- 
quirements. Historically, this has been done via code in- 
tegration, or by running the new function on a gateway 
or virtual storage appliance (VSA [37]). Code integration 
generally performs best. However, the new function must 
run on the same OS version, the controller’s main func- 
tionality is vulnerable to bugs due to lack of isolation, 
resource management is complicated for software which 
assumes a dedicated system, and development complex- 
ity increases in particular when the new function already 
exists as independent software. The gateway approach 
offers isolation but adds both latency and hardware costs. 

A hypervisor can isolate the new function, allow for 
differing OS versions, and simplify development. How- 
ever, until now the high performance overhead of virtu- 
alization (in particular virtualized I/O) has made this ap- 
proach impractical. In this paper, we show how to use 
server-based virtualization to integrate new functions into 
a storage system with near zero performance cost. Our 
approach is in line with the VSA approach, but we run 
the VM directly on the storage system. 

While our work was done using KVM [14], our in- 
sights are not KVM-specific. We do take advantage of the 
fact that KVM uses an asymmetric model in which some 
of the code is virtualized (the new features) while other 
code (the original storage system) runs on “bare metal,” 
unaware of the existence of the hypervisor. 

There are three sources of performance overhead. Base 
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overheads include aspects such as virtual memory man- 
agement or process switching. External communication 
with storage clients is important when the new function is 
a “filter” on top of the original storage system, e.g., a file 
server. Finally, internal communication overheads are in- 
curred to tie the new function to the original controller. 

To reduce base overhead, we use two main techniques. 
First, we statically allocate CPU cores to the guest to en- 
sure that the function has sufficient resources. Second, 
we Statically allocate memory for the VM, backing that 
area with larger pages to reduce translation overheads. 

The straightforward implementation of external com- 
munication is expensive because the hypervisor inter- 
venes when physical events occur (e.g., interrupts or de- 
vice accesses). Each such intervention entails an ex- 
pensive “exit” from the guest code to the hypervisor. 
The highest-performing approach for reducing this over- 
head is device assignment, which eliminates exits for de- 
vice access. Thus, to reduce these costs, we assign the 
network device directly to the guest using an SR-IOV- 
enabled adapter [23] which allows the guest to send re- 
quests directly to the device. To eliminate exits for inter- 
rupts, we use polling instead of interrupts, a well-known 
technique in storage systems. 

To reduce the cost of internal communication, we mod- 
ified KVM’s para-virtual block driver to poll as well, 
eliminating exits due to PIOs and interrupt injections. 
This provides for a fast, exit-less, zero-copy transport. 

By using these techniques, we show no measurable dif- 
ference in network latency between bare metal and virtu- 
alized I/O and under 5% difference in throughput. For 
internal communication, micro-benchmarks show 6.6js 
latency overhead, read throughput of 357K IOPS, and 
write throughput of 284K IOPS; roughly seven times bet- 
ter than a base KVM implementation. In addition, an I/O 
intensive filer workload running in KVM incurs less than 
0.4% runtime performance overhead compared to bare 
metal integration. 

Our main contributions are: 


e a detailed, benchmark-driven analysis of virtualiza- 
tion overheads in a storage system context, 

e aset of approaches to removing overheads, and 

e a demonstration of how these approaches enable 
running new storage features in a VM with essen- 
tially zero runtime performance overhead. 
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The rest of the paper is organized as follows. Section 2 
provides background on KVM and VM I/O. We take 
an incremental approach to show our performance im- 
provements; Section 3 describes the experimental envi- 
ronment. Sections 4 and 5 present a performance analysis 
and describe optimizations related to the external and in- 
ternal communication interfaces, respectively. Base over- 
heads are shown together with macro-benchmark results 
are in Section 6. Section 7 describes related work and we 
conclude in Section 8. 


2 x86 I/O Virtualization Primer 


We now provide some background information on KVM 
(the hypervisor used in this paper) and virtual machine 
I/O. There are two main options for where a hypervisor 
resides. Type | hypervisors run directly on the hard- 
ware, whereas type 2 hypervisors are hosted by an OS. 
KVM takes a hybrid approach that combines the bene- 
fits of both. It is a Linux kernel module that leverages 
Intel VT-x or AMD-V CPU features for running unmod- 
ified virtual machines, thereby creating a single host ker- 
nel/hypervisor that runs both processes and virtual ma- 
chines. Such a hybrid architecture allows the storage con- 
troller software to run unmodified on bare metal while 
also running additional functionality in virtual machines. 

There are three main methods for accessing I/O de- 
vices in VMs. In the first, emulation, the hypervisor em- 
ulates a specific device in software [35]. The OS run- 
ning in the VM (guest OS) uses its regular device drivers 
to access the emulated device. This method requires no 
changes to the guest, but suffers from poor performance. 

In the second method, para-virtualization [4], the 
guest OS runs specialized code to cooperate with the hy- 
pervisor to reduce overheads. For example, KVM’s para- 
virtualized drivers use virtio [26], which presents a ring 
buffer transport (vring) and device configuration as a PCI 
device. Drivers such as network, block, and video are 
implemented using virtio. In general, the guest OS driver 
places pointers to buffers on the vring and initiates I/O 
via a Programmed I/O (PIO) command. The hypervisor 
directly accesses the buffers from the guest OS’s mem- 
ory (zero-copy). Para-virtualized devices perform better 
than emulated devices, but require installing hypervisor- 
specific drivers in the guest OS. 

The third method, device assignment [6, 17,39], gives 
the VM a physical device that it can submit I/Os to with- 
out the hypervisor’s involvement. An I/O Memory Man- 
agement Unit (OMMU) provides address translation and 
memory protection [6,7,38]. Interrupts, however, are 
routed to the guest OS via the hypervisor. Assigning a 
device to the VM means that no other OS can access 
it Gncluding the hypervisor or other guests). However, 
technologies such as Single Root I/O Virtualization (SR- 
IOV) [23] allow devices to be assigned to multiple OSs. 
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3 Experimental Setup 


We take an incremental approach to showing how to 
eliminate the virtualization overheads. For our ex- 
periments we used two servers, each with two quad- 
core 2.93GHz EPT-enabled Intel Xeon 5500 processors, 
16GB of RAM and an Emulex OneConnect 10Gb Eth- 
ernet adapter. The servers were connected with a 10Gb 
cable. One server acted as a load generator and the other 
was our (emulated) storage controller platform. 

We used RHEL 5.4 with the RedHat 2.6.18-164.el5 
kernel for both the load server and the guest. The con- 
troller server used the RedHat kernel for bare metal runs 
and Ubuntu 9.10 with a vanilla 2.6.33 kernel for KVM 
runs. The newer kernel was necessary for running KVM. 

The controller server was run with four cores enabled, 
unless otherwise specified. For VM-based experiments, 
two cores and 2GB of memory were assigned to the 
guest; all four cores were used by the host in the bare 
metal cases. 

We used an 8GB ramdisk for the storage back-end in 
the experiments described in Section 5 and 6. This al- 
lowed us to measure I/O performance without physical 
disks becoming the bottleneck. We accessed the ramdisk 
via a loopback device, which allowed us to assign disk 
I/O handling to specific cores, similar to the way a stor- 
age controller functions. 

All results shown are the averages of at least 5 runs, 
with standard deviations below 5%. 


4 Network Communication Performance 


Enabling the guest to interact with the outside world re- 
quires I/O access. As discussed in Section 2, each of the 
three common approaches to I/O virtualization has bene- 
fits and drawbacks. We identified device assignment— 
the best performing option—as the most suitable ap- 
proach for adding new functionality to storage con- 
trollers. KWM?’s initial device assignment implemen- 
tation, however, did not provide the necessary perfor- 
mance. In the remainder of this section, we analyze de- 
vice assignment and discuss a set of optimizations which 
allowed us to achieve near bare-metal performance. 

Virtualization overhead is mainly due to events that are 
trapped by the hypervisor, causing costly exits [1,5, 16]. 
The overhead is a factor of the frequency of exits and the 
time it takes the hypervisor to handle the exit and resume 
running the guest. To examine the performance impact 
of virtualization for our intended use and ways to reduce 
it, we focused on networking micro-benchmarks. Our 
goal is to minimize the amount of time that the hypervisor 
needs to run, by minimizing the number of exits. 

The first technique that we used to improve the guest’s 
performance is related to the handling of the h1t (halt) 
and mwait x86 instructions. When the OS does not have 
any work to do it can call these instructions to enter a 
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power saving mode. Most hypervisors will trap these 
commands and will run other tasks on the core. In our 
case, however, the new function should always run. We 
therefore instructed the guest OS to enter an idle loop 
when there is no work to be done by enabling a kernel 
boot parameter (idle=pol1l1). This improves perfor- 
mance, as the guest is always running. 


The second technique that we used is related to inter- 
rupt handling. Most of the guest exits related to device 
assignment are caused by interrupts [2,5, 18]. Every ex- 
ternal interrupt causes at least two guest exits: first, when 
the interrupt arrives (causing the hypervisor to gain con- 
trol and to inject the interrupt to the guest) and when the 
guest signals completion of the interrupt handling (caus- 
ing the host to gain control and to emulate the comple- 
tion for the guest). The guest can configure the adapter to 
use two different interrupt-delivery modes: MSI, which 
is the newer message based interrupt protocol, or the 
legacy INTX protocol. The KVM implementation we 
used incurred additional overhead when using MSI inter- 
rupts, due to additional exits for masking and unmasking 
adapter interrupts. Since most of the virtualization over- 
head comes from interrupts, our approach is to run the 
adapter in polling rather than interrupt-driven mode. 


In Linux today, most network adapters use NAPI [30, 
31], a hybrid approach to reducing interrupt overhead 
which switches between polling and interrupt-driven op- 
eration depending on the network traffic. However, even 
with NAPI, we have seen interrupt rates of 70K interrupts 
per second. Since such a high interrupt rate can incur pro- 
hibitive overhead and interrupts are not necessary for our 
intended use case, we decided to forgo interrupts and use 
polling. Our polling driver creates a new thread for the 
polling functions. The adapter we use has three types of 
events: packet received, packet sent, and command com- 
pletion. Since there is no way to know when a packet 
will be received, our polling driver continuously polls for 
packets received; packet sent and command completion 
indications are handled by the same polling thread every 
so often. Using aconstantly polling thread means that we 
dedicate most of a core for this functionality. While this 
might seem expensive from the resources perspective, it 
proved critical to achieve the desired performance. A sin- 
gle core could also be used to poll multiple devices by 
integrating their polling threads into a single thread, or 
by scheduling different polling threads on the same core. 
We did not experiment with this configuration. 


Next we evaluate the performance of the polling driver 
using network micro-benchmarks. Table 1 depicts the av- 
erage duration time of a ping flood command going from 
a client machine to the system under test. The system un- 
der test replies to pings using our polling driver either in 
polling mode or in INTX mode. The driver runs either in 
the host (bare-metal), or in the guest with halt disabled, 


-__[ Bare-metal | Guest | Guest halt 


ONIX [24] 4 | 8 


Table 1: Ping average latency (1s) 
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or in the guest with halt enabled (guest halt). 

Figure 1 shows the results for several netperf 
request-response configurations, measuring round-trip 
time using | byte packets. guest msi and guest intx stand 
for the guest using MSI and INTX interrupt delivery, re- 
spectively. host msi stands for the host using interrupts 
in MSI mode; guest poll and host poll stand for the guest 
and host using polling mode, respectively. As expected, 
polling mode achieves better performance than interrupt 
mode in the host (i.e, on bare metal). Since in guest mode 
the cost of interrupts is much higher, the gain from using 
polling is more significant than in the bare-metal case. 
Using MSI interrupts in guest mode has significant im- 
pact on the performance with this KVM version since 
there are frequent exits due to interrupt masking calls by 
the guest. 

Figure 2 shows the results of a single-threaded 
netperf send TCP throughput test (system under test 
is sending) in the same configurations as the previous fig- 
ure: host using polling and INTX interrupts, guest using 
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Figure 3: netperf TCP receive throughput 


polling, INTX, and MSI interrupts. Here the contribution 
of polling is less noticeable, since the TCP stack batches 
network processing. On bare metal, polling provides bet- 
ter performance than interrupt mode. In guest mode, the 
advantage of polling is much more significant. 

Figure 3 shows the results of a single-threaded 
netperf receive TCP throughput test (system under 
test is receiving). Here it is surprising to see that for the 
bare-metal case the performance of polling (host poll) 
is less than that of interrupts (host msi). The reason is 
that in the netperf throughput test the sender is the bot- 
tleneck. When the receiver is working in polling mode, it 
sends many more acknowledgment packets to the sender. 
For example in the case of 1K messages, the receiver 
sends approximately 10 times as many ACKs. Since 
the sender is already the bottleneck, sending more ACKs 
generates more load on the sender, which reduces sender 
throughput. While this issue is noticeable for this micro- 
benchmark, in practice, the handling time of a packet 
by the receiver is much larger, hence in most cases the 
sender is not the bottleneck. Polling achieves the same 
performance in the guest poll and host poll cases, which 
indicates that the virtualization-induced runtime over- 
head is negligible. 

To verify that the reduced polling performance for the 
receive test is an artifact of TCP, we ran the same test us- 
ing UDP. With UDP, all setups—guest or bare metal, in- 
terrupts or polling—achieve the same performance. Be- 
cause the sender is the bottleneck, once the TCP ACK 
effect is removed, performance is not affected by the re- 
ceiver’s mode of operation. 


5 Internal Communication Performance 


Of the three methods for accessing I/O devices described 
in Section 2, we use para-virtualization for internal com- 
munication. Para-virtualization performs better than em- 
ulated devices, and because we supply the VM image that 
runs in the controller, we can easily use custom drivers. 
Further, our goal is to transmit I/O requests to a controller 
process running on the host, so device assignment is less 
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Figure 5: Para-virtualized block I/O path with polling. 











practical. For example, we cannot assign the drives be- 
cause the storage controller must “own” them, and not the 
guest OS. One may also consider using external commu- 
nication to access the controller via iSCSI or Fibre Chan- 
nel, but this adds unnecessary communication overheads. 
We use ramdisk as the backing store for our analysis 
to prevent the disks from dominating latencies or becom- 
ing a bottleneck. In addition, we use direct I/O to prevent 
caching effects that mask virtualization overheads. La- 
tencies presented are the average over 10 minutes. 
Section 5.1 describes the vanilla KVM para-virtualized 
block I/O, and Section 5.2 describes our optimizations. 


5.1 KVM Para-virtualized Block I/O 


Figure 4 depicts the unmodified para-virtualized block 
I/O path in KVM, along with associated latencies for ma- 
jor code blocks when executing a 4KB direct I/O write 
request. The guest application initiates an I/O, which is 
handled by the guest kernel as usual. The direct I/O wait- 
ing time (DIO Wait, 16.6% of the total), consists of world 
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switches and context switches between threads inside the 
guest. Though we have drawn it as one block, it is in- 
terleaved with other code running on the same core. The 
para-virtualized block driver (virtio-block front-end) it- 
erates over the requests in the elevator queue and places 
each request’s I/O descriptors on the vring, a queue re- 
siding in guest memory that is accessible by the host. 
The driver then issues a programmed I/O (PIO) command 
which causes a world switch from guest to host. 

Control is transferred to the KVM kernel module to 
handle the exit. The post- and pre-execution times (24% 
and 18.4%, respectively) account for the work done 
by both KVM and QEMU to change contexts between 
the guest and QEMU process (including the exit/entry). 
KVM identifies the cause of the exit and, in this case, 
passes control to the QEMU virtio-block back-end (BE). 
It extracts the I/O descriptors from the vring without 
copies and passes the requests to the block driver layer 
(QEMU BDRV), which initiates asynchronous I/Os to 
the block device. The guest may now resume execution. 

An event-driven dedicated QEMU thread receives I/O 
completions and forwards them to the virtio-block BE. 
The BE updates the vring with completion information 
and calls upon KVM to inject an interrupt into the guest, 
for which KVM must initiate a world switch. When the 
guest resumes, its kernel handles the interrupt as normal, 
and then accesses its APIC to signal the end of interrupt, 
causing yet another exit. Locks to synchronize the two 
QEMU threads incur additional overhead. 


5.2 Para-virtualized Block Optimizations 


To reduce virtualization overhead, we added a polling 
thread to QEMU as depicted in Figure 5. The thread 
polls the vring (1) for I/O requests coming from the guest 
and (2) for I/O completions coming from the host kKer- 
nel. The polling thread invokes the virtio-block BE code 
on incoming I/Os and completions. This thread does not 
necessarily need to reside in QEMU; if the storage con- 
troller is polling-based, its polling thread may be used. 

As discussed in Section 4, we added a thread to the 
guest which polls the networking device. We utilize this 
same thread to poll the vring for I/O completions. When 
it detects an I/O completion event, it invokes the guest 
I/O completion stack, which would normally be called 
by the interrupt handler. By using polling on both sides 
of the vring, we avoid all I/O-related exits, and thus also 
avoid all of the pre- and post-guest execution code. We 
also avoid locking the queue, since now only the polling 
thread accesses it. For the 4KB direct I/O write, this im- 
proves the latency from 50s to 15.9ys. 

Comparing Figures 4 and 5, we see that polling bet- 
ter utilizes the CPU for I/O-related work. Additionally, 
components that we didn’t directly optimize (such as the 
VFS layer, for example) are more efficient thanks to bet- 


ter cache utilization and less cache pollution due to fewer 
context switches. 

We performed two additional code optimizations in 
QEMU to reduce latencies, whose impact is already in- 
cluded in the above discussion. When accessing a guest’s 
memory, QEMU must first translate the address using a 
page-table—like data structure. This handles cases where 
the guest’s memory can be remapped (for example, when 
dealing with PCI devices). In our case, the memory lay- 
out is static, rendering the translation unnecessary. Re- 
moving unnecessary lookups improved performance by 
4.6% for 4KB reads and 4.2% for 4KB writes. The sec- 
ond optimization is to use a memory pool for internal 
QEMU request structures. This saved 3% for 4KB reads 
and 2.5% for 4KB writes. 


5.3. Overall Performance Calculation 


A storage controller running a new function in a VM that 
uses interrupts for its internal communication would have 
a rather significant performance penalty. Looking at Fig- 
ure 4, the corresponding storage controller implementa- 
tion would look similar, except that the AIO calls would 
be replaced by asynchronous calls to the controller code. 
We consider any work done from the time the applica- 
tion submits the I/O until it reaches the controller to be 
virtualization overhead (work that would not be done if 
running directly on the host). In the unmodified case, the 
overhead is 49s (we subtract only the latency of the ap- 
plication layer from the total). 

If our techniques were integrated into a controller, we 
would calculate the latency overhead as follows, based 
on Figure 5. We begin with the total, 15.9us, and sub- 
tract the application layer, as we did in the previous case. 
Further, we subtract the QEMU BDRYV layer, and the 
AIO system call and completion, because these would 
be replaced by the controller code, and are therefore not 
considered virtualization overhead. The final overhead is 
therefore 7.7 us before the two QEMU optimizations, and 
6.6s after. 

To put the overheads in context, we estimate our per- 
formance impact on the fastest latencies published using 
the SPC-1 benchmark since 2009 [34]. The fastest re- 
sult was 130s, and our virtualization technique would 
add approximately 5% overhead to this case (the baseline 
case would add approximately 38%). The average of the 
27 controllers’ fastest latencies is 482,us, and in this av- 
erage case, our virtualization techniques would add only 
1.4% (the baseline would add over 10%). 

Our improvements affect throughput in addition to 
latency. To measure these effects, we ran micro- 
benchmarks consisting of multi-threaded 4KB direct 
I/Os. For multi-threaded 4KB direct I/Os, we improved 
read IOPs by a factor of 7.3x (from 48.8K to 357.5K), 
and write IOPS by 6.5x (from 43.8K to 284.1K). 
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Figure 6: File server workload with 6 cores 


6 File Server Workload 


We next tested the end-to-end performance of running a 
server in a VM on a storage controller. We ran a file 
server in our VM, and used dbench [36] v4.00 to gen- 
erate 4KB NFS read requests which all arrived at the 
10Gb NIC, went through the local file system and block 
layers, through the para-virtualized block interface, and 
were satisfied by the ramdisk on the host side. We al- 
ways allocated two cores to the controller function, and 
either two or four to the file server (as specified). In the 
virtualized cases, all file server cores were given to the 
VM. All cores were fully utilized for all cases. 

Figure 6 shows the results when running with six 
cores: two for the controller function and four for the file 
server. Bars | and 2 show the bare metal case without 
polling and with, respectively. Roughly the same perfor- 
mance is attained in both cases. The third bar shows the 
baseline measurement for the guest, which is a signifi- 
cant degradation as compared to the bare metal cases. We 
identified three main causes for this performance drop. 

First, we noticed a large number of page faults on 
the host caused by the running VM. We mitigated this 
using the Linux kernel’s HugePages mechanism, which 
backs a given process with 2MB pages instead of 4KB 
pages. This allows the OS to store fewer TLB page en- 
tries, resulting in fewer TLB faults and fewer EPT table 
lookups. HugePages improved performance by 10.5%, as 
shown in the fourth bar of Figure 6. A feature in a recent 
Linux kernel release makes the use of HugePages auto- 
matic [10]. The second issue affecting performance was 
halt exits, described in Section 4. We avoid these exits by 
setting the guest scheduler to poll. This further improved 
performance by 7.3% (fifth bar in Figure 6). The final 
performance improvement was to add driver polling, for 
both the network and block interfaces (described in Sec- 
tions 4 and 5.2). This further improved performance by 
19.7%, and brings the guest’s performance to be statisti- 
cally indistinguishable from bare metal. 

Next, we ran the same workload, but this time allo- 
cated only two cores to the file server (four cores total). 
This may be a more common deployment when running 
multiple server VMs on a single physical host, for exam- 
ple, because there are less cores available for each VM. 
The bare metal results are depicted in Figure 7(a). The 
first bar shows the bare metal baseline performance of 
442.1 MB/s. We see in the second bar that performance 
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Figure 7: File server workload with 4 cores 


drops to 331.1 MB/s when using polling. This is because 
the host now has only two cores, and the polling thread 
utilizes a disproportionate amount of CPU resources as 
compared to the file server. We remedied this by reducing 
the CPU scheduling priority of the polling thread (bar 3), 
and by setting the CPU affinities of the polling thread 
and some of the file server processes so that they share 
the same core (bar 4). These two changes bring the per- 
formance back to baseline performance. 

In the guest case, depicted in Figure 7(b), the baseline 
(bar 1) is approximately 36% lower than the bare metal 
case. Bar 2 includes the HugePages and idle polling op- 
timizations previously described, and bar 3 adds driver 
polling. Similar to the bare metal case, we adjusted the 
polling thread scheduling priority and the affinities of the 
relevant processes (bars 4 and 5). This brings us to results 
that are statistically indistinguishable from bare metal. In 
all cases, tuning was not difficult, and a wide range of 
values provided the achieved performance. 


7 Related Work 


Several works explored the idea of running VMs on stor- 
age controllers. The IBM DS8300 storage controller 
uses logical partitions (LPARs) to enable the creation of 
two fault-isolated and performance-isolated virtual stor- 
age systems on one physical controller [12]. Pivot3 [24] 
and ParaScale [22] are integrated virtualization and scale- 
out SAN storage platforms that are geared to data centers. 
Fido [8] investigated using shared memory to implement 
zero-copy inter-VM communication in Xen in the con- 
text of enterprise-class server appliances. Our focus is 
different in that we investigate external communication, 
zero-copy communication with the controller software, 
and various techniques and methods to reduce overheads 
caused by I/O virtualization. 

Block Mason [21] used building blocks implemented 
in VMs to extend block storage functionality. VMware 
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VSA [37] pools the internal storage resources of several 
servers in a shared storage pool, using dedicated virtual 
machines running on each server. 


Several works explored off-loading I/O to dedicated 
cores [3,15,16,19]. The closest to ours is VPE [19], 
which adds host-side polling to KVM’s virtio network 
stack. The VPE thread polls the network device for in- 
coming packets and polls the guest device driver for new 
requests. However, the guest incurs exit overheads for 
interrupts and I/O completions since its driver does not 
poll. Dedicating cores for improving I/O performance 
has also been explored in TCP onloading [25, 32, 33]. 


There have been several works that investigated reduc- 
ing interrupt overhead. The Linux kernel uses NAPI to 
disable interrupts of incoming packets as long as there 
are packets to be processed [30,31]. A hybrid approach 
is to use interrupts under low load, and polling when 
more throughput is needed [11]. With interrupt coalesc- 
ing, a single interrupt is generated for a given number of 
events or in a pre-defined time period [2, 27]. A series of 
works compared these techniques qualitatively and quan- 
titatively [28,29]. Rather than polling for fixed intervals 
or according to arrival rates, QAPolling uses the system 
state as determined by applications’ receive queues [9]. 
The Polling Watchdog uses a hardware extension to trig- 
ger interrupts only when polling fails to handle a message 
in a timely manner [20]. 


ELI (ExitLess Interrupts [13]) is a recently-published 
software-only approach for handling interrupts within 
guest virtual machines directly and securely. ELI re- 
moves the host from the interrupt handling paths, thereby 
allowing guests to reach 97%—100% of bare-metal per- 
formance for I/O-intensive workloads. 


$8 Conclusions and Future Work 


We have shown how to use a hypervisor to host and iso- 
late new storage system functions with negligible runtime 
performance overhead. The techniques we demonstrated 
such as polling, dedicated cores, avoiding page lookups, 
etc., while not general purpose are a good fit to our usage 
scenario and have a significant payback. 


There are several possible extensions. First, ELI [13] 
is a promising new approach for exitless interrupts which 
would remove the need to poll in the guest. We are in- 
vestigating incorporating it into our system. Second, if 
we stay with polling, we can explore ways to better uti- 
lize the polling cores, e.g., to on-board the TCP stack 
to a polling core. Third, we can also benchmark these 
techniques when running multiple VMs. Finally, we can 
examine how to leverage the fact that we have virtual- 
ized the new storage function’s implementation to take 
advantage of features such as VM migration to improve 
performance and availability. 
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Abstract 


Good execution of data placement, caching and consis- 
tency policies across a user’s personal devices has always 
been hard. Unpredictable networks, capricious user be- 
havior with leaving devices on or off and non-uniform 
energy-saving policies constantly interfere with the good 
intentions of a storage system’s policies. This paper’s 
contribution is to better manage these inherent uncertain- 
ties. We do so primarily by building a low-power com- 
munication channel that is available even when a device 
is off. This channel is mainly made possible by a novel 
network interface card that is carefully placed under the 
control of storage system protocols. 

The design space can benefit existing placement 
policies (e.g., Cimbiosys [21], Perspective [23], 
Anzere [22]). It also allows for interesting new ones. We 
build a file system called ZZFS around a particular set of 
policies motivated by user studies. Its policies cater to 
users who interact with the file system in an ad hoc way 
— spontaneously and without pre-planning. 


1 Introduction 


Much work has been done in developing appropriate 
data placement, caching and consistency policies in the 
‘“‘home/personal/non-enterprise” space (e.g., see [8, 10, 
16, 19, 20, 21, 22, 23, 24, 26, 28]). Good policies are 
crucial in maintaining good performance, reliability and 
availability. Unfortunately, there are many barriers that 
make the execution of such policies far from automatic. 
These barriers often stem from the unpredictability of ev- 
eryday life, reflected in variable network resources, de- 
vices being off or dormant at inconvenient times, and 
users’ time and priority given to data management. 
Consider two mundane examples (Section 2 has 
more): In the first example, a busy mom desires to show 
a friend in the mall a photo that happens to be on the 
home computer. That same person might wish to access 


her personal medical file (that she does not trust the cloud 
for storing) from the beach while on holidays later in the 
week. In all likelihood she will find the above tasks im- 
possible given that her home computer is most likely dor- 
mant or off, and she has not had time to specify any par- 
ticular data replication policy among the computer and 
the smartphone, or hoarded the files beforehand. 

The second example illustrates a consistency problem 
and is taken from Live Mesh’s [14] mailing list. Many 
technology-savvy users experienced frequent conflicts 
with music files. A single user would listen to music on 
device A, then later listen to the same music on device 
B while A was turned off (the files were kept in peer-to- 
peer sync between A and B because the user did not have 
enough space on the cloud to store all files). Because the 
particular music player software updated song metadata 
(like play count and rating), it turns out that this is not a 
read-only workload. As a result, the syncing generated 
conflicts requiring manual resolution whenever the user 
switched devices. It is unfortunate that even in the ab- 
sence of true multi-user concurrency, a single user can 
still get an inconsistent view of the system. 

This paper’s main contribution 1s to build a low-power, 
always-on communication channel that is available even 
when a device is off. The hypothesis is that this channel 
reduces the likelihood that a device is unreachable and 
thus helps the execution of data placement and consis- 
tency policies. We build this channel using new hardware 
and storage system protocols. 

On the hardware front, we incorporate a novel network 
interface card (NIC) in the design of the overall storage 
system (Section 3.1). The NIC maintains device network 
access with negligible energy consumption even when 
the device is dormant. The NIC 1s able to rapidly turn on 
the main device if needed. The ability to turn on the main 
device can be thought of as Wake-on-Lan(WoL) [11] “on 
steroids,” in that the NIC operates through any firewalls 
or NAT boxes, does not need to know the MAC address 
of the dormant device, and handles mobility across sub- 
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nets. The NIC also exports to our storage system a small 
on-board flash storage. While the hardware part of the 
NIC is not a contribution of this paper, we build the stor- 
age system software around it. 

We design the I/O communication channel on top of 
the NIC by leveraging several technical building blocks. 
These are not new individually, but, as we discovered, 
work well together to lead to a usable system. In partic- 
ular, we use data placement protocols based on version 
histories for ensuring consistency (Section 3.3); I/O of- 
floading [15, 29] is used to mask any performance laten- 
cies of turning on a device on a write request by using 
the NIC’s flash storage as a versioned log/journal (Sec- 
tion 3.3); and users get a device-transparent view of the 
namespace with the metadata by default residing on the 
cloud. Metadata can also reside on any device with the 
always-on channel implemented (Section 3.2). 

Fundamentally, our approach makes good use of any 
always-on resources, if available (such as the cloud or 
a home server), but also actively augments the number 
of always-on resources by turning any personal device 
with the new network interface card into an always-on re- 
source. Perhaps subtly, however, it turns out that having a 
few extra always-on resources allows for interesting data 
placement policies that were not possible before. We ex- 
plore these through building a file system called ZZFS. 
We chose to implement a unique set of data placement 
and consistency policies that cater mostly to spontaneous 
users (Section 4). These policies were partially influ- 
enced by qualitative user research. However, other poli- 
cies (e.g., Cimbiosys [21], Perspective [23], Anzere [22]) 
would equally benefit. 


2 Background on the problem 


Users often have access to a set of devices with storage 
capabilities, such as desktops, laptops, tablets, smart- 
phones and data center/cloud storage. Data placement 
policies revolve around deciding which user’s data or 
files go onto which device. Often, a data placement 
policy indicates that the same file should be placed on 
multiple devices (e.g., for better reliability, availability 
and performance from caching). Consistency policies re- 
volve around ways of keeping the multiple file replicas in 
sync as to provide the abstraction of a single file to users. 
We illustrate problems related to the execution of these 
policies through three simple examples, that reflect poli- 
cies taken from some recent related work. 

Example 1: Two replicas of a file: This example de- 
fines the terminology and thus is slightly longer than the 
subsequent two. Systems like Perspective [23], Cim- 
biosys [21] and Anzere [22], allow a photographer to 
say “keep all my photos replicated on my work machine 
and tablet.” Imagine a user U and a photo file F. It is 
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very likely that when U edits F from the work machine, 
the tablet is dormant so the changes do not immediately 
propagate to the tablet. Typical implementations of this 
policy make use of a transaction log L that keeps track of 
the changes U makes on the work machine. The log is 
later replayed on the tablet to maintain consistency. 

When the photographer later on moves to work on the 
tablet, the log will still be on the now-dormant work ma- 
chine. Thus, the tablet is not able to replay the log. The 
user has two options, neither which leads to great satis- 
faction with the system: option | 1s for the user to manu- 
ally turn on the work machine and wait until all the data 
is consistent. This option is implicitly assumed in Per- 
spective, for example. Option | may be out of reach for 
non tech-savvy users who just want to get on with their 
work and do not understand they have to wait (“for how 
long?’’) for consistency to catch up. 

Option 2 is to continue working on the stale copy of 
F on the tablet, keep a separate transaction log Lz of the 
work in the tablet, and then later on, when both machines 
happen to be up at the same time, have a way to reconcile 
L and L». In the best case, the copies can be reconciled 
automatically (e.g., the user is working on two different 
parts of the photo that can be just merged). In the worst 
case, manual conflict resolution is required. Option 2 
is in fact the only option if there is truly no other way 
the devices can communicate with one another (e.g., if 
the user is on a plane with the tablet and with no net- 
work connectivity). However, it seems wasteful human 
effort that the user has to resort to this option even when 
the network bandwidth in many places (e.g., within the 
home, or work) would be perfectly adequate for auto- 
matic peer-to-peer sync, if only the devices were on. 

Example 2: Device transparency: Several systems 
advocate device transparency, where the namespace re- 
flects ones’ files and data, not the device where they re- 
side. Eyo, for example, allows a user to list from any 
device the metadata (e.g., name) of all files, residing in 
all subscribed devices [26]. We like the idea of the meta- 
data being always available, but want to help further by 
satisfying the user’s data needs as well. Imagine a user 
U having the names of all her documents, photos and 
videos, displayed on her tablet. When U meets a friend 
in the mall, she wishes to show her a short video from 
a birthday party. The video happens to physically re- 
side on her home computer (although the metadata is on 
the tablet). There is reasonable 3G bandwidth to stream 
the video, but the home computer is dormant. The user 
knows the video exists, but cannot access it. 

Example 3: Cloud storage: Having sufficient storage 
space to store a// user data in the cloud with fast network 
connectivity to access it seems technically likely in the 
next few years (perhaps sooner in Silicon Valley). How- 
ever, any consideration of data placement must include 
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Figure 1: Storage system architecture, basic interfaces and Somniloquy hardware in action. 


human factors as well as technology and cost trends. Hu- 
man factors include, among others, trust in the cloud and 
desire to possess, know and control where one’s data 
is located. Section 4 describes qualitative user studies 
we did in the context of this paper. From those studies, 
we believe that devices will continue to be places where 
users store some of their data. As such, we fully em- 
brace the cloud as another place to store data, and we let 
users ultimately decide how to use that place. We do not 
second-guess them or force them to automatically place 
everything on the cloud. Internally, the system makes 
good use of available cloud space (e.g., for storing meta- 
data — Section 3.2, and versioned logs — Section 3.3). 


On the technical front, our system helps users who 
might have slow network connections to the cloud. Imag- 
ine a scenario in which a user decides to store a substan- 
tial amount of his data on the cloud. A user editing an 
article and compiling code while traveling benefits from 
the device’s cache to batch writes before sending them to 
the cloud. When the user returns home and wants to con- 
tinue working on the data from his home PC, he finds the 
PC’s state is stale and incurs large performance penal- 
ties until the state is refreshed. A good cache placement 
policy would automatically hoard the user’s working set 
to the home cache before the user would need to use it. 
Such a policy is hampered, however, because the home 
PC is likely dormant before the user arrives. 


Intuition on how this paper helps: This paper is 
about enabling a satisfying execution of a user’s data 
placement, caching and consistency policies given the 
likelihood that devices they rely on are dormant. One 
way our system will help the situation in Example 1 is 
by allowing peer-to-peer sync policies to work by turn- 
ing devices on and off rapidly and automatically. If peer- 
to-peer sync would not be advisable (e.g., because of bat- 
tery considerations), the system temporarily offloads the 
transaction log L onto the cloud. In Example 2, the sys- 
tem will continue to present a device-transparent view of 


metadata, and will rapidly turn on the home computer to 
get the data to the user. In Example 3, either peer-to- 
peer or cloud-to-device cache syncing will be enabled by 
turning the devices whose caches need refreshing on. 


3 Design 


Figure | shows several building blocks of the storage 
system. First, storage-capable devices strive to always 
maintain a low-power communication channel through a 
new low-power network card. Second, a metadata ser- 
vice maintains a unified namespace, encompassing any 
available storage space on devices, cloud and any home 
servers. Third, an I/O director, in cooperation with the 
metadata service and the new communication channel, 
manages the I/O flow through the system. 


3.1 


Data placement and consistency protocols are helped if 
devices maintain an always-on communication channel, 
even when dormant or off. Of course, such a channel 
should consume minimal power. We chose to use a new 
network interface card, called Somniloquy, that is de- 
signed to support operation of network-facing services 
while a device is dormant. Figure | shows it operating 
with one of our desktops. Somniloquy was first described 
by Agrawal et al. [2] in the context of reducing PC energy 
usage. The hardware is not a contribution of this paper. 
This paper reports on Somniloquy’s role and integration 
into a distributed personal storage system. 

Somniloquy consumes between one and two orders of 
magnitude less power than a PC in idle state. Somnilo- 
quy exports a 5 Mbps Ethernet or Wireless interface (Fig- 
ure | shows a prototype with the Ethernet interface) and 
a few GB of flash storage. Somniloquy runs an embed- 
ded distribution of Linux on a low power 400 MHz XS- 
cale processor. The embedded OS supports a full TCP/IP 
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stack, as well as DHCP and serial port communication. 
Power consumption ranges between 290 mW for an idle 
wireless interface, 1073 mW for the idle Ethernet inter- 
face, and 1675 mW when writing to flash [2]. 

Somniloquy allows a dormant device to remain re- 
sponsive to the network. The NIC can continue to com- 
municate using the same IP address as the dormant de- 
vice. Somniloquy is more appropriate than Wake-on- 
LAN (WoL) [11] for mobile storage devices, because it 
operates through firewalls and NAT boxes, and it han- 
dles mobility across subnets. The on-board processor 
maintains contact with a DNS server to preserve the 
hostname-to-IP address mapping, performs basic net- 
working tasks, and does I/O to its local flash card. 

Does the new NIC make the overall system less se- 
cure? Our experience is incomplete. Logically, the sys- 
tem is running the same storage service as before. How- 
ever, because parts of that service now run on the NIC’s 
processor, the attack surface on the system as a whole 
has increased. Also, while modern processors have ad- 
ditional security features such as execute-disable bits to 
prevent buffer overflows, our low power processor does 
not support these features yet. Denial-of-service attacks 
might result in drained batteries. To partially mitigate 
these problems we force the NIC to only listen on one 
port (5124) that belongs to the storage service. Further, 
we require the main device and the NIC’s processor to be 
on the same administrative domain. 

Somniloquy is the hardware part of the solution, but it 
is insufficient without the storage and file system soft- 
ware. Here we give intuition on how the I/O direc- 
tor (Section 3.3) will use Somniloquy for two common 
operations: reads and writes. A read to a file on a 
Somniloquy-enabled storage device incurs a worst-case 
latency when the request arrives just as the device is go- 
ing into standby. Somniloquy will wake up the device 
and the latency is at least standby + resume time. Table 1 
shows some measurements to understand this worst-case 
penalty. Future devices are likely to have faster standby 
and resume times. Writes do not have a similar latency 
penalty. The I/O director can temporarily offload data to 
Somniloquy’s flash card, or nearby storage-capable re- 
sources (such as the cloud) if these are available. 

Summary, limitations and alternatives: We design 
to allow devices to maintain network awareness even 
when dormant. Our specific way of enabling the goal is 
to introduce new NIC hardware to each device. Agrawal 
et al. [2] describes why Somniloquy is more appropriate 
than several other hardware-based alternatives (e.g., Tur- 
ducken [25]) and we do not list those alternatives further 
here. An assumption we make is that it is cost effec- 
tive to augment devices with a smarter network interface 
card. Further, we assume the NIC would not drastically 
change the failure characteristics of the device. These 
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Device Standby(s) Resume(s) 
Lenovo x61 (Win7) 3.8 2.6 
Dell T3500 (Win7) 8.7 Ta 
HP Pavillon (XP) 4.9 10.25 
Macbook Pro (OSX 10.6.8) ] 2 
Ubuntu 11.10 11 4.5 


Table 1: Example suspend and resume times for com- 
modity devices. The device is first rebooted to clear pre- 
vious state then it is put into standby followed by a re- 
sume. Section 5.2 shows more realistic end-to-end mea- 
surements using the Dell T3500 device. 


assumptions might turn out to be a limitation of our ap- 
proach, depending on the economics of producing a de- 
vice and its failure characteristics. Another limitation is 
a lack of evaluation of Somniloquy with tablets or smart- 
phones. Currently the driver works for Windows Vista/7 
only, which limits the experiments in Section 5 to lap- 
tops and desktops. Currently, the NIC can only wake up 
devices that are placed into standby, and are not fully off. 
A software-based alternative would be to maintain de- 
vice network awareness by encapsulating a device in a 
virtual machine abstraction and then making sure the vir- 
tual machine (VM) is always accessible. SleepServer, 
for example, migrates a device’s VMs to an always-on 
server before the physical device goes dormant [3]. This 
alternative might be more appropriate in enterprise envi- 
ronments where VMs are used and dedicated always-on 
servers are available, rather than for personal devices. 


3.2 Metadata service 


The metadata service maintains a mapping among an ob- 
ject/file ID, the devices that object is stored onto, and 
the replication policy used. The MDS uses a flat object- 
based API by default, where each object ID is an opaque 
128-bit string. The metadata service (MDS) is a logi- 
cal component, and it can reside on any device or server. 
The metadata service might be replicated for availabil- 
ity. Consensus among replicated services could be main- 
tained through the Paxos protocol [22]. Furthermore, the 
data belonging to the service might be replicated for reli- 
ability, or cached on devices for performance. Data con- 
sistency needs to be maintained across the replicas. 

The low-power communication channel in Section 3.1 
helps with MDS availability and reliability in the fol- 
lowing way. If the service is replicated among devices 
for availability, Somniloquy wakes up dormant devices 
that need to participate in the consensus protocol. If the 
data belonging to the MDS is replicated, the I/O director 
strives to maintain strong consistency through a range of 
techniques described in Section 3.3. A reasonable de- 
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fault for home users is to have a single instance of the 
metadata service run on a cloud server with content repli- 
cation factor of 1, 1.e., instead of being replicated, the 
metadata content is cached on all devices (this is what 
our file system implementation in Section 4 does). The 
metadata content can be cached on all devices since its 
size 1s usually small (Section 5.5). 


The client library caches a file’s metadata when a file 
is created and pulls metadata updates from the metadata 
service when accesses to a file fail. The latter could 
happen either because the file has moved or it has been 
deleted, the access control policy denies access, or the 
device has failed. A client’s library synchronously up- 
dates the MDS when metadata changes. Those updates 
could be subsequently pushed by the metadata service to 
other devices caching the metadata (the push could be 
lazy, e.g., daily, or could happen as soon as the change 
occurs). For the common case when a device is dormant, 
Somniloquy could wake up the device (or absorb the 
writes in its flash card temporarily) to update its cache. 
A client might choose to pull the latest metadata explic- 
itly (e.g., through a Refresh button), rather than using the 
push model. While the design supports both models, we 
believe a hybrid pull and lazy push model is a reasonable 
default for home users. 


Our design requires storage devices to be explicitly 
registered with the MDS. If a device is removed from 
the system, either because it has permanently failed or 
because a newer device has been bought that replaces it, 
a user needs to explicitly de-register the old device and 
register the new device with the MDS. The metadata ser- 
vice initiates daily heartbeats to user devices to detect 
permanent failures and to lazily refresh a device’s meta- 
data cache. A heartbeat wakes up a dormant device. A 
device is automatically rebuilt after the user triggers the 
rebuild process. 


Summary, limitations and alternatives: The novel 
aspect of our metadata service is that the execution of 
both metadata service consensus (for availability) and 
metadata replication consistency protocols (for reliabil- 
ity and performance through caching) is helped by the 
ability to turn participating devices on and off transpar- 
ently. The design allows for several consensus and con- 
sistency options. However, by default the MDS resides 
on the cloud and its content is cached on all devices. The 
implicit assumption for this default is that the user will 
have at least (>56 Kbps) broadband connectivity at home 
or work and some weak 3G connectivity when mobile. 
Further, we assumed a few hundreds of MB of storage 
space at a cloud provider. We believe this is a weak as- 
sumption, but, even in the absence of cloud space, the 
metadata service and data could still reside on any de- 
vice that incorporates Somniloquy. 


3.3. I/O director 


The I/O director is the third building block of our design. 
Its goal is to be versatile, allowing for a range of data 
placement and consistency policies. Uniquely to our sys- 
tem, the I/O director has new options for data movement. 
It can choose either to wake up a device to make reads or 
writes, or to temporarily use the flash storage provided by 
Somniloquy; it can also opportunistically use other stor- 
age resources to mask performance latencies and main- 
tain the always-on communication channel. 


The operations of the I/O director are best understood 
through Figure 2, which shows a client, a metadata ser- 
vice (MDS) and two devices D; and D2. The data is 
replicated on both devices with a particular primary- 
based concurrency control mechanism to serialize con- 
current requests. In this example, each replicated file has 
one replica that is assigned the primary role. Figure 2 
shows some common paths for read and write requests. 
Reads, in the default case, are serviced by the primary for 
an object, as seen in Figure 2(a). When all devices are 
dormant and a read or write request arrives, Somniloquy 
resumes the device and hands it the request as shown in 
Figure 2(b) for reads and Figure 2(d) for writes, respec- 
tively. Writes are sent to the primary, which serializes 
them to the other replicas of the object as in Figure 2(d). 


When objects are replicated and a device goes into a 
controlled standby, the metadata service receives an RPC 
indicating that, as seen in Figure 2(c). This is an op- 
timization to give the MDS the chance to proactively 
assign the primary role away from that device to de- 
vices that are on. As might be expected, transferring the 
primary role does not involve data movement, just net- 
work RPCs to inform devices of the new assignment. A 
client’s metadata cache might be stale with the old pri- 
mary information, so a read will initially go to the dor- 
mant device. However, the device is not turned on, since 
the primary does not reside there. Instead, the client 
times out, which triggers an MDS lookup and cache re- 
fresh. The read then proceeds to the device with the pri- 
mary, which happens to be on in this example. 


The I/O director implements I/O offloading [15, 29] to 
mask large write latencies and to implement the logging 
subsystem. The logging subsystem gives the abstraction 
of a single virtual log to the whole distributed storage 
system. The actual log might reside on any storage de- 
vice. Its size is limited by the size of cloud space, plus 
NIC flash space, plus all free hard drive space across all 
devices. Figure 2(e) shows offloading to the log (Sec- 
tion 5 evaluates the case when the log physically resides 
on a nearby device). Remember that if parts of the log 
are on the dormant device’s hard drive, that device can be 
woken up as needed to access the log. Data is eventually 
reclaimed at the expected device at appropriate times, 
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Figure 2: Read and write protocols and common cases. D stands for device and p indicates that the primary for the 
file being accessed is on that device. “Off” and “on” indicate whether the device is dormant or not. 


e.g., when the device is not in use. 


The system is optimized for the common case when 
there 1s some network connectivity among devices and 
the cloud. If that is not the case, e.g., when the user is 
on a plane without network access, the system will tem- 
porarily offload all user writes to the log, and the log will 
have to physically reside with the user’s device locally. 
When the user gains network connectivity, all partici- 
pating devices will have to eventually reclaim data from 
the log and do standard conflict resolution (e.g., as in 
Bayou [28]), as illustrated in Figure 2(f). Our work does 
not add anything novel to this scenario’s logic, but our 
implementation makes use of the existing logging infras- 
tructure to keep track of write versions. 


A user can move the file to a new device, and can 
change its replication policy any time. When any of these 
options happen, our system allows continuous access to 
the file. Any new writes to the file are offloaded to the 
versioned log. The I/O director logic maintains the nec- 
essary bookkeeping to identify the location of the latest 
version of a file. The location could be the old location, 
or the log, depending on whether the file has seen any 
new writes while being transferred or not. Once the file 
has moved to the new location, reclaim is triggered to 
copy any bytes that might have changed. 


Summary, limitations and alternatives: The novel 
aspect of the I/O director is that it has new options 
for data movement. It can also choose to turn on a 
dormant device. The I/O director is optimized for an 
increasingly-common case of at least basic network con- 
nectivity among storage devices. It reverts to well-known 
conflict resolution techniques otherwise. 


We currently use I/O offloading techniques [15, 29] 
to augment the base file system (which is not versioned) 
with a versioned file system partition. Ursa Minor’s tech- 
niques for data placement versatility [1] are a good al- 
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ternative in case the underlying file system is already 
versioned. For example, Ursa Minor uses backpointers 
when changing data replication while maintaining data 
availability. Also, advanced data encoding policies (e.g., 
the use of erasure codes), and other concurrency control 
methods (e.g., based on quorums) could equally benefit 
from our always-on communication channel. 


3.4 Interaction with energy policies 


As remarked above, Somniloquy consumes more than an 
order of magnitude less energy than an idle device while 
maintaining network awareness. The default interaction 
with energy policies is simple. A read overrides the en- 
ergy policy and wakes up the device. Writes are fully 
buffered in the NIC’s card or cloud before waking up the 
device. These policies are similar to the ones offered by 
BlueFS [16], in that they actively engineer and divert the 
traffic to the right device, but we have more resources 
available, in the form of the NIC’s flash card or cloud. 


Because the NIC runs a capable operating system, 
more complex energy policies can be encoded as part of 
the NIC processing. For example, BlueFS reduces en- 
ergy usage by reading data from the device that will use 
the least amount of energy. That policy could be slightly 
modified to take into account the device turn on time, if 
the device is dormant. Furthermore, the storage system 
could determine whether to wake up a device or not as 
a function of whether the device is plugged in or run- 
ning on batteries. Also, a more advanced standby strat- 
egy might predict future access patterns and prevent the 
computer from going into standby. Currently, our de- 
vices use simple idle time-based policies, like the ones 
implemented on Windows. 
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4 ZZES: a file system artifact 


Perhaps surprisingly, having a few extra always-on re- 
sources allows for interesting data placement policies 
that were not possible before. We explore these through 
building a file system called ZZFS. We chose to imple- 
ment a unique set of data placement and consistency poli- 
cies that cater mostly to spontaneous data accesses. 


4.1 Design rationale 


The design rationale for ZZFS is indirectly influenced by 
data from qualitative user research, comments on mailing 
lists of popular sync tools like Live Mesh [14] and Drop- 
box [4], and our desire to explore new policies. ZZFS’s 
policies are different from those of say, Cimbiosys [21] 
or Perspective [23], but not necessarily “better” or appro- 
priate in all cases. 

Data from sync programs: To understand how users 
perceive consistency and conflict problems and how they 
rate them in fix-priority when compared to performance 
problems we collected and analyzed user feedback for 
Live Mesh [14] and Dropbox [4], two popular rsync 
tools. They serve as a rather coarse proxy for understand- 
ing consistency in the absence of a distributed file sys- 
tem. Feedback from the sync programs is heavily biased 
toward early adopters and technology experts, of course, 
but it is nevertheless helpful if only because of its volume 
(thousands of messages on public forum boards). Exam- 
ple 1 in Section 2 was influenced by this data. 

Qualitative studies: Our first qualitative study helped 
us understand how people understand, organize, store, 
and access their content across different devices. The 
users for the qualitative studies were picked at random 
by a third-party company that specializes in user stud- 
ies. We performed “guerrilla” (street) interviews with six 
people. We visited two family homes and we then invited 
two different families to a conference room (provided by 
the third-party company so that our identities would re- 
main unknown to avoid perception bias) to further dis- 
cuss concepts through storyboards. The raw data we col- 
lected is available upon request, but we have not put it 
in paper form yet. In parallel, we conducted a second, 
larger-scale study on issues around data possession [17]. 

How the data influenced us: This research influenced 
us to try harder to cater to the character of data access 
and device management displayed by ordinary (1.e., non 
technical) users. We interpret the data as suggesting that 
syncing and replication policies are compromised by the 
ways users store data, their ad hoc access of networks, 
and the priority given to social and economic matters of 
data management. 

By default, ZZFS caters to spontaneous users with no 
data placement policies specified at all by default. No 


user effort is required to pre-organize data on devices (by 
hoarding, syncing, etc.) Data by default remains on the 
device where the user chose to first create it, with a repli- 
cation factor of 1. Users showed a greater concern for 
and doubts about transferring data between devices than 
device failure. This could be interpreted as similar to 
Marshall’s observation that only 5% of data loss is due 
to a device failure [12]. 

Whenever a file needs to be accessed, the device it is 
on is asked to provide access to that file. If the device is 
dormant, the device is woken up through the I/O direc- 
tor and the network-aware part of the device. For more 
advanced users who worry more about device failure and 
thus specify a higher replication factor for files, ZZFS 
strives to reduce the time it takes to reach consistency 
among replicas by data offloading and by waking up de- 
vices as described in Section 3.3. 

We found that users made deliberate and intelligent 
decisions about wanting to silo their data on different de- 
vices and the cloud. From both user studies, we believe 
that devices will continue to be places where users store 
their data. Any consideration of data placement must 
consider human factors as well as technology and cost 
trends. Human factors include, among others, trust in 
the cloud and desire to possess, know and control where 
one’s data is located. Furthermore, different devices 
have unique affordances [6] and properties (e.g., screen 
size, capacity, weight, security, price, performance, etc.). 
Users seem capable of understanding those affordances, 
and ZZEFS does not second guess. Data movement is in- 
curred only when a user explicitly chooses to do so. 


4.2 Implementation details and status 


We have implemented most of the design space described 
in Section 3. ZZFS is a distributed file system that re- 
sults from picking a set of policies. It is implemented 
at user-level in C. ZZFS supports devices whose local 
file system can be NTFS or FAT. ZZFS has implemented 
per-object replication and allows for in-place overwrites 
of arbitrary byte ranges within an object. Concurrent ac- 
cesses to a file are serialized through a primary. ZZFS’s 
namespace is flat and it does not have folders, however it 
maintains collections of files through a relate() call. 

The current implementation addresses a limited set of 
security concerns. Data and network RPCs can be en- 
crypted (but are not by default). Each object has an ac- 
cess control list that specifies which user can access that 
object and from what device. We are actively doing re- 
search in what security means for home users [13]. 

In addition to simple benchmarks that directly access 
ZZFS through a client library, we run unmodified, legacy 
applications (e.g., MS Office, iTunes, Notepad, etc.) for 
demoing and real usage. We do so by mounting ZZFS 
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as a block-device through the use of a WebDav ser- 
vice [31]. This technique required us to detour the Web- 
Dav service to use our APIs [9]. WebDav file seman- 
tics are different from NTFS semantics and often lead 
to performance inefficiencies (e.g., any time a change is 
made to a file, WebDav forces the whole file to be sent 
through the network). The following calls are detoured 
to use ZZFS’s calls: CreateFile(), FindFirstFile(), Find- 
NextFile(), ReadFile(), WriteFile(), GetFileAttributes() 
and DeleteFile(). The interface currently is Windows Ex- 
plorer. A more appropriate interface for a distributed file 
system is work-in-progress. 

ZZFS is robust. We are using it daily as a secondary 
partition to store non-critical files. When it crashes, it 
usually does so because of the NIC’s device driver. The 
driver issues will be resolved over time and were not our 
primary focus for this paper. However, we are working 
toward having ZZFS as a primary partition for all files. 


5 Evaluation 


First, we measure how ZZFS performs and locate its 
bottlenecks. Second, through a series of real scenarios, 
we measure latencies and penalties associated with the 
always-on communication channel. This is an evalua- 
tion of the underlying storage system and also of ZZFS’s 
policies. Third, we provide analytical bounds for perfor- 
mance for arange of workload and device characteristics. 
Fourth, we examine metadata scalability. 


5.1 Exposing throughput bottlenecks 


This section focuses on throughput. The other sections 
will focus on latency. We compare our system against 
local file access through the NTFS file system. This is 
the only time we will use a set of homogeneous devices 
(obviously not realistic for personal devices), because it 
is simpler for revealing certain types of bottlenecks. The 
devices are three HP servers, each with a dual-core Intel 
Xeon 3 GHz processor and 1 GB of RAM. The disk in 
each device is a 15 KRPM Seagate Cheetah SCSI disk. 
The devices have a 1 Gbps NIC. All reads and writes are 
unbuffered, 1.e., we do not make use of the RAM. 

First, we measure peak bandwidth and IOPS (I/Os per 
second) from a single device (“Read.1” and “Write.1” in 
Figure 3). Bandwidth is measured in MB/s using 64 KB 
sequential reads and writes to a preallocated 2 GB file. 
IOPS are measured by sending 10,000 random-access 
4 KB IOs to the device with 64 requests outstanding at a 
time. Figure 3 shows the average from 5 results (the vari- 
ance is negligible). Average local streaming NTFS per- 
formance (not shown in graph) is 85 MB/s for reads and 
writes and 390 IOPS for reads and 270 IOPS for writes; 
hence, ZZFS adds less than 8% overhead. 


FAST 712: 10th USENIX Conference on File and Storage Technologies 











100 - - 1000 

07 , 900 
= 80 - + 800 
= 70 - IOPS ¢ BW _ 700 
= 60 - ~ 600 
< a 
s 50 — + 500 O 
s 40 - + 400 
< 
a 30 300 
SS 20.4 - 200 

10 - + 100 

0 + | 0 
Read.1 Read.max Write.1 | Write.max 


Figure 3: Baseline bandwidth and IOPS. 


Second, we measure maximum bandwidth and IOPS 
from all three devices to understand performance scala- 
bility (““Read.max” and “Write.max” in Figure 3). Three 
clients pick one random 2 GB file to read or write to, out 
of a total of 10 available files. Each file is replicated 3- 
way. If all clients pick the same file, accesses still go to 
disk since buffering is disabled. Figure 3 shows the re- 
sults. As expected from 3-way replication, the saturated 
write bandwidth is similar to the bandwidth from a sin- 
gle device. Saturated read bandwidth is about a third of 
the ideal because requests from all three clients interfere 
with one another. This problem exists in many storage 
systems because of a lack of performance isolation [30]. 
Saturated IOPS from all devices is close to the ideal of 
3x the IOPS from a single device. 

Overall, these results show that our system performs 
reasonably well with respect to throughput. Optimiza- 
tions are still required, however, especially with respect 
to reducing CPU utilization. CPU utilization in the sat- 
urated cases was close to 100%, mostly due to unneces- 
sary memory copies. 


5.2 I/O director 


This section focuses on read and write latencies resulting 
from the always-on channel. We have real measurements 
from a home wireless network. We start by illustrating 
the performance asymmetry between reads and writes. 
The first workload is an I/O trace replay mimicking a 
user listening to music. We use trace replay to just focus 
on I/O latencies and skip the time when the user listens 
to music and no I/O activity is incurred. Half of the mu- 
sic is on his laptop, half on the desktop and the setting is 
in “shuffle mode” (1.e., uniform distribution of accesses 
to files). The music files are not replicated. The desktop 
(Dell T3500 in Table 1) is on a 100 Mbps LAN and the 
laptop (Lenovo x61) is on a 56 Mbps wireless LAN. The 
Somniloquy NIC is attached to the desktop. The MDS 
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Figure 4: A scatter plot over time for a client’s re- 
quests. Reads latencies are [mean=0.09 s, 99"=0.36 s, 
worst=23.3 s]. Write latencies are [mean=0.014 s, 
998-0 .022 s, worst=0.058 s]. There are several per- 
formance “bands” for local reads (0.001-0.01 s), remote 
writes (0.05-0.1 s) and remote reads (0.05-1 s). 


resides on the desktop, but all metadata is fully cached 
on the laptop as well. The music program issues 64 KB 
reads to completely read a music file, then, after the user 
has finished listening, a database is updated with a small 
write of 4 KB containing ratings and play count updates. 
The database resides on the desktop and is not replicated. 
Hence, although this is a common workload, it is quite 
complex and has both reads and writes. The user sim- 
ply wants to listen to music without worrying where the 
music files and database are located. 


Figure 4 shows a scatter plot (and latency distribution 
in the caption) of the worst-case scenario when request to 
read a music file comes just as the desktop is starting to 
go into standby. Somniloquy intercepts the read request 
and signals the computer to wake up. The time it takes 
the computer to accept the request is 23.3 s (standby time 
+ resume time) and is illustrated in the scatter plot in 
the figure. In practice, prefetching the next song would 
be sufficient not to notice any blocking; however, when 
prefetching is not possible, this serves as a worst-case 
illustration. We note that the desktop is rather old, and if 
using a newer device (e.g., the Macbook Pro in Table 1) 
the worst case latency would be around 4 seconds. 


Figure 5 illustrates that writes do not suffer from this 
worst case scenario. The workload in this scenario is 
a trace replay of I/O activity mimicking a user sending 
64 KB writes to a document from the laptop. The user 
uses 2-way replication for those files, with the second 
replica kept on the desktop. Both laptop and desktop are 
on the wired LAN. Similar to the previous case, the desk- 
top has gone abruptly into standby. However, there is a 
second laptop nearby that is on, and the I/O director tem- 
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Figure 5: A scatter plot over time for a client’s write re- 
quests. O.start annotates the time the second device en- 
ters standby, and thus offloading begins to a third device. 
R.start annotates the time when the second device re- 
sumes and reclaim starts (offloading thus ends). R.end 
annotates the time when all data has been reclaimed. 
Write latencies are [mean=0.1 s, 99’’=1 s, worst=1.5 s]. 


porarily offloads the writes onto that laptop (other op- 
tions for the offload location are Somniloquy’s flash card 
or the cloud). This way, 2-way, synchronous replication 
is always maintained. When the desktop comes out of 
standby, the data on the third laptop is reclaimed. Re- 
claim does not lead to a noticeable latency increase. The 
figure shows a slight increase in latency during data of- 
fload since the second laptop is on the wireless LAN. A 
handful of requests experience high latencies throughout 
the experiment. We believe these are due to the perfor- 
mance of the wireless router. Note that writes in this 
experiment are slower than in Figure 4 because of larger 
write request sizes (64 KB vs. 4 KB) and 2-way replica- 
tion vs. no replication. 

We compare our system against simple ping and av- 
erage disk latencies, 1.e., we set a relatively high bar to 
compare against. We measured a minimum of 0.06 s ping 
latency for 64 KB!, 0.005 s for 4 KB sizes and the disk’s 
average latency is 0.015 s (these are slow SATA disks, 
not the fast SCSI disks used in the previous section). 
Hence, an end-to-end read request (and ack) should take 
on average 0.075 s and an end-to-end write request (and 
ack) should take on average 0.02 s*. Looking at the per- 
formance “bands” in Figure 4, we see that local read la- 
tency and remote write latency is very good, while re- 
mote read latency is 33% slower than ideal. We have 


'Exact size is 65500 B, the maximum ping size. 

* Although read requests are sequential, the disk head incurs at least 
a full disk rotation before receiving the next request, since the requests 
are sent one at a time. Also, experienced disk average latencies are 
sometimes better than the above theoretical value because our disk is 
not full and the files are on its outer tracks. 
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Figure 6: A scatter plot over time showing the ef- 
fects of moving a file on concurrent operations on 
that file. Without offloading, the concurrent work- 
load blocks; with offloading, the concurrent workload 
makes progress. When offloading, read latencies are 
[mean=0.7 s, 99’"=8.6 s, worst=17 s]. Write latencies 
are [mean=0.5 s, 99"=7.3 s, worst=14 s]. 


started collecting detailed performance profiles, but we 
note that the delay is unnoticeable to the applications. 

File move: The next experiment demonstrates how 
moving an object affects performance of concurrent op- 
erations on that object. As discussed in Section 3.3, in- 
stead of locking the file for the duration of the move, 
the I/O director offloads any new writes to the file while 
the copy is in progress. In this experiment, we move a 
1 GB file from one device to another while simultane- 
ously running a series of 64 KB read and write (with 
a 1:1 read:write ratio) operations on that object. Fig- 
ure 6 shows that, with offloading turned off, the read 
and write operations must block until the data move is 
complete; with offloading turned on and another laptop 
temporarily absorbing new writes, these operations make 
progress. The devices are limited by the 56 Mbps wire- 
less LAN, and the network is saturated during the file 
move, hence access performance during that time is slow 
(around 10s). We believe this is better than blocking for 
more than 400 s (the latency of “blocked request” in the 
figure). Note that after the move completes, performance 
improves because the client is co-located with the device 
the file is moved onto. 


5.3. ZZFS’s placement policy 


Next, we measure ZZFS’s performance in a 3G city-wide 
network and an intercontinental network. We look at the 
performance resulting from the simple policy of leaving 
data on the device it was first created. We illustrate the 
performance of our system when a user on the move is 
accessing music files stored on the home desktop. Unlike 
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Figure 7: A CDF plot for a client’s read and write re- 
quests over a 3G city-wide network and intercontinen- 
tal network. For the 3G network, read latencies are 
[mean=0.21 s, 99"=0.35 s, worst=3.39 s]. Write laten- 
cies are [mean=0.17 s, 99’"=0.3 s, worst=0.3 s]. For the 
intercontinental network read latencies are [mean=0.7 s, 
99'"=8 2 5, worst=11s]. Write latencies are [mean=0.2 s, 
99'"=0.4 s, worst=0.4 s]. 


the music scenario above, the client has no music files 
or metadata cached on the laptop and always reads and 
writes to the home desktop. Access sizes are the same as 
before (64 KB reads and 4 KB writes). 


First, when the user is on a city-wide 3G network, she 
is connected to the Internet through a ZTE MF112 mo- 
bile broadband device connected to her laptop. Figure 7 
shows the latency results. The first request incurs a first- 
time setup cost from the 3G provider, which is also the 
worst-case latency (we do not know what the provider is 
doing; subsequent runs do not incur this penalty, but we 
show the worst case). We measured a minimum of 0.23 s 
ping latency for 64 KB sizes, 0.13 s for 4 KB sizes in this 
environment, and ZZFS’s overhead is comparable. The 
latency is good-enough for listening to music. 


Second, when the user is on the west coast of the US 
(Redmond, Washington) she is connected to the Internet 
through a 56 Mbps wireless LAN. The location of the 
music files is on a desktop in Cambridge, UK. Figure 7 
shows the results. We measured a minimum of 0.25 s 
ping latency for 64 KB sizes, 0.19 s for 4 KB sizes in 
this environment. ZZFS’s write overhead is comparable, 
but its average read latency is 60% higher than ping. We 
believe this is due to the unpredictable nature of the in- 
tercontinental network. Nevertheless, the user does not 
perceive any noticeable delay once the music starts. 


A takeaway message from this section is that ZZFS’s 
performance is good enough in all cases for the appli- 
cations involved. Data is never cached in these experi- 
ments, so we expect even better performance in practice. 
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Figure 8: Latency tradeoffs for a client’s read requests. 


5.4 Sensitivity to parameters 


This section reexamines the above scenarios and others 
analytically while changing several tunable parameters. 

In the next analysis, we revisit the music scenario. 
We still have two devices D; and D2. First, we vary 
the amount of idle time J before D; enters standby (D2 
never enters standby since it is the device with the mu- 
sic player). Without loss of generality, we assume D,’s 
average access latency when D is on, L,?%, is slower 
than D»’s average access latency Lz (e.g., D; could be on 
the 3G network). L)°/?" is the average access latency 
when Dj is on standby. It is the time to resume the device 
plus Ly on 

Second, we vary the fraction of files p; that reside on 
the slower device (p2 = 1 — p,). For example, if D, 
enters into standby after J = 15 idle minutes and each 
song is on average M = 5 minutes in length, D; will en- 
ter standby if at least |J/M| = 3 consecutive songs are 
played from D2 (with no loss of generality, we assume 
the writes go to a database also on Dp this time, other- 
wise D, will never enter standby). Figure 8 shows the 
expected average latency given by: 


E|L| = E[L|D; = ON|p{D, = ON}+ 


1 
E|L|D, = STDBY|p{D, = STDBY} ) 


The above equation further expands _ to 
E[L] = (piLi°% + pol2)p{D1 = ON} + (pi Lye? PP" + 
p2L2)p{D, = STDBY}. The analysis assumes a user 
is forever listening to songs, and this graph shows the 
long-running latency of accesses. All the lines assume 
the switch-on times of the Dell T3500, except for the 
low switch-on cost line that is the Mac. 

We make several observations. In both extremes, 
where all files accessed are on D> or all files are on Dj, 
the latency is simply that of D2 or D, respectively. If a 
device goes into standby, the worst latency tends to hap- 
pen when the user accesses it infrequently, thus giving it 
time to standby and then resuming it. 
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Figure 9: Latency tradeoffs for a client’s requests when 
data is replicated on both devices. 


The next analysis examines the impact of the 
read:write ratio of the workload. 2-way replication 1s 
used, and the same arguments are made about standby. 
The difference is that, in this case, D; enters standby if 
there are consecutive reads on D> (a write would wake 
up D, since it needs to be mirrored there.) Without loss 
of generality, we assume a read or write comes every 5 
minutes and D, enters standby after J = 15 minutes. We 
illustrate the impact of turning on the device vs. always 
offloading (unrealistic in practice) vs. temporarily of- 
floading while the device switches on. We assume with- 
out loss of generality that data is offloaded to a slow de- 
vice, e.g., a data center. 

Figure 9 shows the expected average latency E|L] (a 
similar formula to the previous example is used, but the 
standby latency is the offload latency). We make several 
observations. For an all-read workload all files are read 
from D> (faster device). For an all-write workload the 
latency is determined by the slowest device. This slower 
device is either the offload device, if we always offload, 
or D. In all cases, offloading masks any switch-on costs. 


5.5 Metadata 


Table 2 shows the number of files for four families the 
authors of this paper are part of. This data is biased to- 
wards families with tech-savvy members. However, the 
point we make in this section is not that this data 1s repre- 
sentative of the population at large. We only confirm an 
observation made by Strauss et al. [26] that the amount 
of metadata involved is small in all cases and could eas- 
ily reside in a data center today, and/or be fully cached 
on most consumer devices. We do this while showing 
that ZZFS’s metadata structures are reasonably efficient. 

We measured the amount of data with R = | and ex- 
trapolated for R = 3. The amount of metadata is calcu- 
lated from ZZFS’s metadata structures and is a function 
of the replication factor and number of files. It is in- 
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teresting to observe that the second family has relatively 
fewer media files, and hence the average file size is much 
smaller than the other families. This translates to a higher 
relative metadata cost. Intuitively, the ratio of metadata 
to data decreases with larger file sizes. 


6 Related work 


Data placement on devices and servers: AFS [8] and 
Coda [10] pioneered the use of a single namespace to 
manage a set of servers. AFS requires that client be 
connected with AFS servers, while Coda allows discon- 
nected operations. Clients cache files that have been 
hoarded. BlueFS [16] allows for disconnected opera- 
tion, handles a variety of modern devices and optimizes 
data placement with regard to energy as well. Ensem- 
Blue [19] improved on BlueFS by allowing for a peer- 
to-peer dissemination of updates, rather than relying on 
a central file server. In Perspective, Salmon et al. use the 
view abstraction to help users set policies, based on meta- 
data tags, about which files should be stored on which 
devices [23]. Recent work on Anzere [22] and Pod- 
Base [20] emphasizes the richness of the data placement 
policy space for home users. 

An implicit assumption of the above work is that home 
users know how to set up these policies. This assump- 
tion might have been borrowed from enterprise systems, 
where data placement decisions can be automated and 
are guided by clear utility functions [27]. Our low- 
power communication channel can help with the exe- 
cution of the above policies and can be used by most 
of the above systems as an orthogonal layer. It ensures 
that devices are awoken appropriately when the storage 
protocols need them to. While our design is compatible 
with the above work, ZZFS’s choice of specific policies 
for data placement is arguably simpler than in the above 
work. It stems from our belief that, for many users, it 
takes too much time and effort to be organized enough 
to specify placement and replication policies like in Per- 
spective or Anzere. ZZFS shows that in many common 
cases, no user involvement is required at all. 

Consistency: Cimbiosys [21] and Perspective [23] 
allow for eventual consistency. Cimbiosys permits 
content-based partial replication among devices and is 
designed to support collaboration (e.g., shared calen- 
dars). Bayou [28] allows for application-specific conflict 
resolution. Our work can help the user’s perception of 
consistency and reduces the number of accidental con- 
flicts. In a system with eventual consistency, the low- 
power communication channel can be seen as helping re- 
duce the “eventual” time to reach consistency, by turning 
dormant devices on appropriately. 

File system best practices: ZZFS builds on consider- 
able work on best-practices in file system design. For ex- 
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Family R_ #files data(GB) metadata(MB)-% 
1 1 23291 582 11 (0.0019%) 
3 =23291 1746 68 (0.0038%) 
Za 1 Sty] 2.44 1.6 (0.06%) 
3 3177 Vege 9.3 (0.12%) 
3 1 31621 705 15 (0.002%) 
3 =6931621 2116 93 (0.004%) 
4 1 124645 164 61 (0.036%) 
3 124645 492 365 (0.07%) 


Table 2: In ZZFS, the size of metadata is O(numfiles 
x numdevices). This table shows the total data and 
metadata size for existing files of some of the authors. 
Files included are “Documents,” “Pictures,” “Videos” 
and “Music.” R is the replication factor. 


ample, our distributed storage system has a NASD-based 
architecture [7], where metadata accesses are decoupled 
from data accesses and file naming is decoupled from lo- 
cation. The system is device-transparent [26]. The I/O 
director maintains versioned histories of files that can 
later be merged and 1s based on I/O offloading [15, 29]. 

User-centered design: We were inspired by a user- 
centered approach to system design. This was manifest 
not only in undertaking a small version of user research 
ourselves (Section 4), but by reference to the findings in 
the HCI literature in general. This literature still remains 
small on the topic dealt with here (e.g., see [20, 23] 
and also [5, 13, 17, 18]), but nevertheless helped provide 
some of the insights key to the technical work which is 
the main contribution of the paper. 


7 Summary 


Unpredictable networks and user behavior and non- 
uniform energy-saving policies are a fact of life. They 
act as barriers to the execution of well-intended personal 
storage system policies. This paper’s contribution is to 
manage better these inherent uncertainties. We designed 
to enable a world in which devices can be rapidly turned 
on and off and are always network aware, even when off 
or dormant. The implications for the file system were 
illustrated through the implementation of ZZFS, a dis- 
tributed device and cloud file system, designed for spon- 
taneous and rather ad hoc file accesses. 
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Abstract 


Conventional wisdom holds that storage is not a big con- 
tributor to application performance on mobile devices. 
Flash storage (the type most commonly used today) draws 
little power, and its performance is thought to exceed that 
of the network subsystem. In this paper we present ev- 
idence that storage performance does indeed affect the 
performance of several common applications such as web 
browsing, Maps, application install, email, and Facebook. 
For several Android smartphones, we find that just by 
varying the underlying flash storage, performance over 
WiFi can typically vary between 100% to 300% across ap- 
plications; in one extreme scenario the variation jumped 
to over 2000%. We identify the reasons for the strong cor- 
relation between storage and application performance to 
be a combination of poor flash device performance, ran- 
dom I/O from application databases, and heavy-handed 
use of synchronous writes; based on our findings we im- 
plement and evaluate a set of pilot solutions to address 
the storage performance deficiencies in smartphones. 


1 Introduction 


Mobile phones, tablets, and ultra-portable laptops are no 
longer viewed as the wimpy siblings of the personal com- 
puter; for many users they have become the dominant 
computing device for a wide variety of applications. Ac- 
cording to a recent Gartner report, within the next three 
years, mobile devices will surpass the PC as the most 
common web access device worldwide [38]. By 2013, 
over 40% of the enhanced phone installed-base will be 
equipped with advanced browsers [57]. 

Research pertaining to mobile devices can be broadly 
split into applications and services, device architecture, 
and operating systems. From a systems perspective, re- 
search has tackled many important aspects: understanding 
and improving energy management [36, 59, 26], network 
middleware [53], application execution models [30, 29], 
security and privacy [25, 32, 34, 39], and usability [27]. 
Prior research has also addressed several important issues 
centered around mobile functionality [55, 65], data man- 
agement [66], and disconnected access [49, 37]. However, 
one important component is conspicuously missing from 
the mobile research landscape — storage performance. 


*Work done as an intern, now at Georgia Institute of Technology 
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Figure 1: Peak throughput of wireless networks. Trends 
for local and wide-area wireless networks over past three 
decades; y-axis is log base 2. 


Storage has traditionally not been viewed as a criti- 
cal component of phones, tablets, and PDAs — at least 
in terms of the expected performance. Despite the impe- 
tus to provide faster mobile access to content locally [40] 
and through cloud services [61], performance of the un- 
derlying storage subsystem on mobile devices is not well 
understood. Our work started with a simple motivating 
question: does storage affect the performance of popular 
mobile applications? Conventional wisdom suggests the 
answer to be no, as long as storage performance exceeds 
that of the network subsystem. We find evidence to the 
contrary — even interactive applications like web brows- 
ing slow down with slower storage. 

Storage performance on mobile devices is important 
for end-user experience today, and its impact is expected 
to grow due to several reasons. First, emerging wireless 
technologies such as 802.11n (600 Mbps peak through- 
put) [68] and 802.llad (or “60 GHz’, 7 Gbps peak 
throughput) offer the potential for significantly higher net- 
work throughput to mobile devices [41]. Figure 1 presents 
the trends for network performance over the last sev- 
eral decades; local-area networks are not necessarily the 
de-facto bottleneck on modern mobile devices. Second, 
while network throughput is increasing phenomenally, la- 
tency is not [62]. As a result, access to several cloud 
services benefits from a split of functionality between the 
cloud and the device [29], placing a greater burden on lo- 
cal resources including storage [51]. Third, mobile de- 
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vices are increasingly being used as the primary comput- 
ing device, running more performance intensive tasks than 
previously imagined. Smartphone usage is on the rise; 
smartphones and tablet computers are becoming a popular 
replacement for laptops [23]. In developing economies, a 
mobile/enhanced phone is often the only computing de- 
vice available to a user for a variety of needs. 

In this paper, we present a detailed analysis of the I/O 
behavior of mobile applications on Android-based smart- 
phones and flash storage drives. We particularly focus on 
popular applications used by the majority of mobile users, 
such as, web browsing, app install, Google Maps, Face- 
book, and email. Not only are these activities available 
on almost all smartphones, but they are done frequently 
enough that performance problems with them negatively 
impacts user experience. Further, we provide pilot solu- 
tions to overcome existing limitations. 

To perform our analysis, we build a measurement in- 
frastructure for Android consisting of generic firmware 
changes and a custom Linux kernel modified to provide 
resource usage information. We also develop novel tech- 
niques to enable detailed, automated, and repeatable mea- 
surements on the internal and external smartphone flash 
storage, and with different network configurations that are 
otherwise not possible with the stock setup; for automated 
testing with GUI-based applications, we develop a bench- 
mark harness using MonkeyRunner [16]. 

In our initial efforts, we propose and develop a set of pi- 
lot solutions that improve the performance of the storage 
subsystem and consequently mobile applications. Within 
the context of our Android environment, we investigate 
the benefits of employing a small amount of phase-change 
memory to store performance critical data, a RAID driver 
encompassing the internal flash and external SD card, us- 
ing a log-structured file system for storing the SQLite 
databases, and changes to the SQLite fsync codepath. 
We find that changes to the storage subsystem can sig- 
nificantly improve user experience; our pilot solutions 
demonstrate possible benefits and serve as references for 
deployable solutions in the future. 

As the popularity of Android-based devices surges, the 
setup we have examined reflects an increasingly relevant 
software and hardware stack used by hundreds of millions 
of users worldwide; understanding and improving the 
experience of mobile users is thus a relevant research 
thrust for the storage community. Through our analysis 
and design we make several observations: 


Storage affects application performance: often in 
unanticipated ways, storage affects performance of 
applications that are traditionally thought of as CPU or 
network bound. For example, we found web browsing 
to be severely affected by the choice of the underlying 
storage; just by varying the underlying flash storage, 
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performance of web browsing over WiFi varied by 187% 
and over a faster network (setup over USB) by 220%. In 
the case of a particularly poor flash device, the variation 
exceeded 2000% for WiFi and 2450% for USB. 

Speed class considered irrelevant: our benchmarking 
reveals that the “speed class’ marking on SD cards is 
not necessarily indicative of application performance; 
although the class rating is meant for sequential perfor- 
mance, we find several cases in which higher-grade SD 
cards performed worse than lower-grade ones overall. 
Slower storage consumes more CPU: we observe 
higher total CPU consumption for the same application 
when using slower cards; the reason can be attributed to 
deficiencies in either the network subsystem, the storage 
subsystem, or both. Unless resolved, lower performing 
storage not only makes the application run slower, it also 
increases the energy consumption of the device. 
Application knowledge ensues efficient solutions: 
leveraging a small amount of domain or application 
knowledge provides efficiency, such as in the case of our 
pilot solutions; hardware and software solutions can both 
benefit from a better understanding of how applications 
are using the underlying storage. 


The contributions of this paper are threefold. First, we 
describe our measurement infrastructure that enables cus- 
tom setup of the firmware and software stack on Android- 
devices to perform in-depth I/O analysis; along with the 
systems software, we contribute a set of benchmarks that 
automate several popular GUI-based applications. Sec- 
ond, we present a detailed analysis of storage performance 
on real Android smartphones and flash devices; to the 
best of our knowledge, no such study currently exists in 
the research literature. We find a strong correlation be- 
tween storage and performance of common applications 
and contribute all our research findings. Third, we pro- 
pose and evaluate pilot solutions to address the perfor- 
mance issues on mobile devices. 


Based on our experimental findings and observations 
we believe improvements in the mobile storage stack can 
be made along multiple dimensions to keep up with the 
increasing demands placed on mobile devices. Storage 
device improvements alone can account for significant 
improvements to application performance. Device man- 
ufacturers are actively looking to bring faster devices to 
the mobile market; Samsung announced the launch of a 
PCM-based multi-chip package for mobile handsets [60]. 
Mobile I/O and memory bus technology needs to evolve 
as well to sustain higher throughput to the devices. Limi- 
tations in the systems software stack can however prevent 
applications from realizing the full potential of hardware 
improvements; we believe changes are also warranted in 
the mobile software stack to complement the hardware. 
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Figure 2: Android Architecture. 
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Figure 3: Overview of Android’s Storage Schema. 


Size and Type 


896 KB 


persistent shared space for OS and bootloader to communicate 


recovery 
boot 
system 
cache 
data 


Alternative boot-into-recovery partition for advanced recovery and maintenance ops 

Enables the phone to boot, includes the bootloader and kernel/initrd 

Contains remaining OS, pre-installed system apps, and user interface; typically read-only 
Android can use it to stage and apply “over the air” updates; holds system images 

Stores user data (e.g., contacts, messages, settings) and installed applications; SQLite database 


4 MB, rootfs 
3.5 MB, rootfs 
145 MB, yaffs2 
95 MB, yatfs2 
196 MB, yaffs2 


containing app data also stored here. Factory reset wipes this partition 


sdcard 
sd-ext 


External SD card partition to store media, documents, backup files etc 
Additional partition on SD card that can act as data partition, setup is possible through a 
custom ROM and data2SD software; non-standard Android partition 


multi-GB, FAT32 
Varies 





Table 1: Data storage partitions for Android. Partitions on internal flash and external SD card for Nexus One phone. 


2 Mobile Device Overview 


2.1 Android Overview 


We present a brief overview of Android as it pertains to 
our storage analysis and development. Figure 2 shows a 
simplified Android stack consisting of flash storage, oper- 
ating system and Java middleware, and applications; the 
OS itself is based on Linux and contains low-level drivers 
(e.g., flash memory, network, and power management), 
Dalvik virtual machine for application isolation and mem- 
ory management, several libraries (e.g., SQLite, libc), and 
an application framework for development of new appli- 
cations using system services and hardware. 


The Dalvik VM is a fast register-based VM provid- 
ing a small memory footprint; each application runs as 
its own process, with its own instance of the Dalvik VM. 
Android also supports “true” multitasking and several ap- 
plications run as background processes; processes con- 
tinue running in the background when user leaves an ap- 
plication (e.g., a browser downloading web pages). An- 
droid’s web browser is based on the open-source WebKit 
engine [4]; details on Android architecture and develop- 
ment can be found on the developer website [2]. 


2.2 Android Storage Subsystem 


Most mobile devices are provisioned with an internal flash 
storage, an external SD card slot, and a limited amount of 
RAM. In addition, some devices (e.g., LG G2X phone) 
also have a non-removable SD partition inside the phone; 
such storage is still treated as external. 


Figure 3 shows the internal raw NAND and external 
flash storage on the Google Nexus One phone. The inter- 
nal flash storage contains all the important system parti- 
tions, including partitions for the bootloader and kernel, 
recovery, system settings, pre-installed system applica- 
tions, and user-installed application data. The external 
storage is primarily used for storing user content such as 
media files (i.e., songs, movies, and photographs), docu- 
ments, and backup images. Table | presents the function- 
ality of the partitions in detail; this storage setup is fairly 
typical across Android devices. 


Applications can store configuration and data on the de- 
vice’s internal storage as well as on the external SD card. 
Android uses SQLite [22] database as the primary means 
for storage of structured data. SQLite is a transactional 
database engine that is lightweight, occupying a small 
amount of disk storage and memory; it is thus popular 
on embedded and mobile operating systems. Applications 
are provided a well defined interface to create, query, and 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


211 


2AZ 


manage their databases; one or more SQLite databases are 
stored per application on /data. 

The YAFFS2 [52] file system managing raw NAND 
flash was traditionally the file system of choice for the var- 
ious internal partitions including /system and /data; 
itis lightweight and optimized for flash storage. Recently, 
Android transitioned to Ext4 as the default file system 
for these partitions [64]. Android provides a filesystem- 
like interface to access the external storage as well, with 
FAT32 as the commonly used file system on SD cards for 
compatibility reasons. 

We believe the storage architecture described in this 
section is similar for other mobile operating systems as 
well; for example, Apple’s iOS also uses SQLite to store 
application data. 1OS Core Data is a data model frame- 
work built on top of SQLite; it provides applications ac- 
cess to common functionality such as save, restore, undo 
and redo. 10S 4 does not have a central file storage ar- 
chitecture, rather every file is stored within the context 
of an application. We focus on Android, since it allows 
systems-level development. 


3 Android Measurement Setup 


Since setting up smartphones for systems analysis and de- 
velopment is non-trivial, we describe our process here in 
detail; we believe this setup can be useful for someone 
conducting storage research on Android devices. 


3.1 Mobile Device Setup 


In this paper we present results for experiments on the 
Google Nexus One phone [12]. We also performed the 
same or a subset of experiments on the HTC Desire [13], 
LG G2X [15], and HTC EVO [14]; the results were simi- 
lar and are omitted to save space. 

The Nexus One is a GSM phone with a | GHz 
Qualcomm QSD8250 Snapdragon processor, 512 MB 
RAM, and 512 MB internal flash storage; the phone is 
running Android Gingerbread 2.3.4, the CyanogenMod 
7.1.0 firmware [10] or the Android Open Source Project 
(AOSP) [3] distribution (as needed), and a Linux kernel 
2.6.35.7 modified to provide resource usage information. 
We present a brief description of the generic OS cus- 
tomizations, which are fairly typical, and then explain the 
storage-specific customization later in this section. 

In order to prepare the phones for our experiments, we 
setup the Android Debug Bridge (ADB) [1] on a Linux 
machine running Ubuntu 10.10. ADB is a command-line 
tool provided as part of Android developer platform tools 
that lets a host computer communicate with an Android 
device; the target device needs to be connected to the host 
via USB (in the USB debugging mode) or via TCP/IP. We 
subsequently root the device with unrevoked3 [20] to flash 
a custom recovery image (ClockworkMod [7]). 

For our experiments we needed to bypass some of 
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the constraints of the stock firmware; in particular, we 
needed support for reverse tethering the mobile device 
via USB, the ability to custom partition the storage, and 
access to a wider range of system tools and Linux util- 
ities for development. For example, BusyBox [6] is a 
software application that provides many of the standard 
Linux tools within a single executable, ideal for an em- 
bedded device. CyanogenMod [10] is a custom firmware 
that provides these capabilities and is supported on a vari- 
ety of smartphones. The Android Open Source Project 
(AOSP) [3] distribution provides capabilities similar to 
CyanogenMod but is supported only on a handful of 
Google-smartphones, including the Google Nexus One. 

We used the CyanogenMod distribution for all exper- 
iments on non-Nexus phones, and for experiments that 
require comparison between a non-Nexus and the Nexus 
One phone (not shown in this paper). All Google Nexus 
One results presented in this paper exclusively use AOSP; 
we equipped both CyanogenMod and AOSP distributions 
with our measurement-centric customizations. 

An important requirement, specific to our storage ex- 
periments, is to be able to compare and contrast applica- 
tion performance on different storage devices. Some of 
these applications heavily use the internal non-removable 
storage. In order to observe and measure all I/O activity, 
we change Android’s init process to mount the different 
internal partitions on the external storage. Our approach 
is similar to the one taken by Data2SD [19]; in addition, 
we were able to also migrate to the SD card the /system 
and /cache partitions. 

In order to adhere to Android’s boot-time compatibil- 
ity tests, we provided a 256 MB FAT32 partition at the 
beginning of the SD card, mounted as /sdcard. The 
/system, /cache, and /data partitions were format- 
ted as Ext3; at the time we conducted our experiments, 
YAFFS2 and Ext3 were the pre-installed file systems on 
our test phones. We performed a preliminary compari- 
son between Ext3 and Ext4 since Android announced the 
switch to Ext4 [64], but found the performance differ- 
ences to be minor; a detailed comparison across several 
file systems can provide more useful data in the future. 

Note that this setup is not normally used by end-users 
but allows us to run what-if scenarios with storage devices 
of different performance characteristics; the internal flash 
represents only a single data point in this set. 

As part of our experiments, we want to understand the 
impact of storage on application performance under cur- 
rent WiFi networks, as well as under faster network con- 
nectivity (likely to be available in the future). For WiFi, 
we set up a dedicated wireless access point (IEEE 802.11 
b/g) on a Dell laptop having 2GB RAM and an Intel Core2 
processor. Since we do not have a faster wireless network 
on the phone, we emulate one by reverse tethering [21] it 
over the miniUSB cable connection with the same laptop 
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Sandisk 
Table 2: Network Kingston 
Performance. Wintec 
Transfer rates for A-Data 
WiFi and USB Patriot 
reverse tether link PNY 
with iperf (MB/s). 


Performance on desktop (MB/s) 
Sq W 


SqR | RnW | RnR 


Performance on phone (MB/s) 
SqW | SqR | RnW | RnR 





Table 3: Raw device performance and cost. Measurements on Desktop with card reader 


(left) and on actual phone (right). “Sq” is sequential and “Rn” is random performance. 


(allowing the device to access the internet connection of 
the host); Table 2 shows the measured performance of our 
WiFi and USB RT link using iperf [46]. 

To minimize variability due to network connections 
and dynamic content, we setup a local web server run- 
ning Apache on the laptop. The webserver downloads the 
web pages that are to be visited during an experiment and 
caches them in memory; where available, we download 
the mobile friendly version of a web site. 

We conducted all experiments on the internal non- 
removable flash storage and eight removable microSDHC 
cards, two each from the different SD speed classes [17]. 
Table 3 lists the SD cards along with their specifica- 
tions and a baseline performance measurement done on 
a Transcend TS-RDP8K card reader! using the CrystalD- 
iskMark benchmark V3.0.1 [9] (shown on the left side). 
The total amount of data written is 100 MB, random I/O 
size 1s 4KB, and we report average performance over 3 
runs; observed standard deviation is low and we omit it 
from the table. Prices shown are as ordered from Ama- 
zon.com and its resellers, and Buy.com (to be treated as 
approximate). We also performed similar benchmarking 
experiments for the eight cards on the Nexus One phone 
itself, using our own benchmark program. Testing con- 
figuration is as before with 4KB random I/O size and 128 
MB of sequential I/O; results in Table 3 (shown on the 
right side) exhibit a similar trend albeit lower performance 
than for desktop. 

To summarize, read performance of the different cards 
is not a crucial differentiating factor and much better over- 
all than the write performance. Sequential reads clearly 
show little or no correlation with the speed class; sequen- 
tial write performance roughly improves with speed class, 
but with enough exceptions to not qualify as monotonic. 
Random read performance is not significantly different 
across the cards. The most surprising finding is for ran- 
dom writes: most if not all exhibit abysmal performance 
(0.02 MB/s or less!); even when sequential write perfor- 
mance quadruples (e.g., Transcend versus Wintec), ran- 
dom writes perform several orders of magnitude worse. 


' Note that internal flash could not be measured this way. 


In terms of overall write performance including ran- 
dom and sequential, Kingston consistently performs the 
worst and tends to considerably skew the results; we try 
not to rely on Kingston results alone when making a claim 
about storage performance. In practice, we find that ap- 
plication performance varies even with the other better 
cards. Transcend performs the best for random writes, by 
as much as a factor of 100 compared to many cards, but 
performs the worst for sequential writes; Sandisk shows a 
similar trend. A-Data, Patriot, Wintec, and PNY perform 
poorly for random, but give very good sequential perfor- 
mance. Kingston and RiData suffer on both counts as they 
not only have poor random write performance, but also 
mediocre sequential write performance (shown in bold in 
Table 3); appliation-level measurements in $4 reflect the 
consequences of the poor microbenchmark results. 


3.2 Measurement Software 

We first explain our measurement environment and the 
changes introduced to collect performance statistics: (1) 
We made small changes to the microSD card driver to 
allow us to check “busyness” of the storage device by 
polling the status of the /proc/storage_usage file. 
(2) We wrote a background monitoring tool (Monitor) 
to periodically read the proc file system and store sum- 
mary information to a log file; the log file is written to 
the internal / cache partition to avoid influencing the SD 
card performance. CPU, memory, storage, and network 
utilization information is obtained from /proc/stat, 
/proc/meminfo, /proc/storage_usage (busy- 
ness) and /proc/diskstats, and /proc/net/dev 
respectively. (3) We use bl kt race [5] to collect block- 
level traces for device I/O. 

In order to ascertain the overheads of our instrumen- 
tation, we conducted experiments with and without the 
measurement environment; we found that our changes in- 
troduce an overhead of less than 2% in total runtime. 

Since many popular mobile applications are interactive, 
we needed a technique to execute these applications in a 
representative and reproducible manner; for this purpose 
we used the MonkeyRunner [16] tool to automate the ex- 
ecution of interactive applications. Our MonkeyRunner 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


Zi 


214 


App Name Size | App Name Size 
YouTube AngryBird 
Google Maps SnowBoard | 23.54 
Facebook Weather 
Pandora Imdb 
Books 
Gallery 
0.70 | Gmail 
GasBuddy 
Twitter 
0.10 | YouTube 


Table 4: Apps for Install and Launch from Android 
Market. Install: top Apps in Aug 2011, total size 55.58 MB, 
avg size 5.56 MB; Launch: 10 apps launched individually. 


Google Sky Map 
Angry Birds 

Music Download 
Angry Birds Rio 
Words With Friends 
Advanced Task Killer 





setup consists of a number of small programs put together 
to facilitate benchmarking with the necessary application; 
we illustrate the methodology next. 

First, we start the Monitor tool to collect resource uti- 
lization information and note its PID. Second, we start 
the application under test using MonkeyRunner which de- 
fines “button actions” to emulate pressing of various keys 
on the device’s touchscreen, for example, browsing for- 
ward and backward, zooming in and out with the touch- 
screen pinch, and clicking on screen to change display 
options. Third, while the various button actions are be- 
ing performed, CPU usage is tracked in order to automat- 
ically determine the end of an interactive action. A class 
function Unt ilIdle() that we wrote is called from the 
MonkeyRunner script to detect the execution status of an 
app; it determines idle status using a specified low CPU 
threshold and the minimum time the app needs to stay 
below the threshold to qualify as idle. Fourth, once the 
sequence of actions is completed, we perform necessary 
cleanup actions and return to the default home screen. 
Fifth, the Monitor tool is stopped and the resource usage 
data is dumped to the host computer. Similar scripts are 
used to reset the phone to a known state in order to repeat 
the experiment (to compute mean and deviation). 


3.3. Application Benchmarks 


We now describe the Android apps that we use to assess 
the impact of storage on application performance; we au- 
tomate a variety of popular and frequently used mobile 
apps to serve as benchmarks. 

WebBench: is a custom benchmark program we 
wrote to measure web browsing performance in a non- 
interactive manner; it is based on the standard WebView 
Java Class provided by Android. WebBench visits a pre- 
configured set of web sites one after the other and re- 
ports the total elapsed time for loading the web pages. 
In order to accurately measure the completion time, we 
made use of the public method of WebView class named 
onProgressChanged(); when a web page is fully 
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loaded, WebBench starts loading the next web page on 
the list. We ran WebBench to visit the top 50 web sites 
according to a recent ranking [8]. 

AppInstall: installs a set of top 10 Android apps on 
Google Android Market (listed in Table 4 on the left), 
successively, using the adb install command. App 
installation is an important and frequently performed ac- 
tivity on smartphones; each application on the phone once 
installed is typically updated several times during subse- 
quent usage. In addition, often times a user needs to per- 
form the install “on the go” based on location or situa- 
tional requirements; for example, installing the IKEA app 
while shopping for furniture, or the GasBuddy app, when 
looking to refuel. 

AppLaunch: launches a set of 10 Android apps using 
MonkeyRunner listed in Table 4 on the right; the apps are 
chosen to cover a variety of usage scenarios: games (An- 
gryBird and SnowBoard) take relatively longer to load, 
read traffic to storage dominates. Weather and GasBuddy 
apps download and show real-time information from re- 
mote servers, i.e., network traffic is high. Gmail and 
Twitter apps download and store data to local database, 
i.e., both network and storage traffic is high. Books and 
gallery apps scan the local storage and display the list of 
contents, i.e., read to storage dominates. Imdb has no 
storage or network traffic due to web cache hits, while 
YouTube launch is network intensive. 

Facebook: uses the Facebook for Android application; 
each run constitutes the following steps: (a) sign into the 
author’s Facebook account (b) load the news feed dis- 
played initially on the phone screen (c) “drag” the screen 
five times to load more feed data (d) sign out. 

Google Maps: uses the Google Maps for Android ap- 
plication; each run constitutes the following steps: (a) 
open the Maps application (b) enter origin and destina- 
tion addresses, and get directions (c) zoom into the map 
nine times successively (d) switch from “map” mode to 
“satellite ’’ mode (e) close application. 

Email: uses the native email app in Android; each run 
constitutes the following steps: (a) open the app, (b) input 
account information, (c) wait until a list of received emails 
appears, and (d) close the application. 

RLBench [56]: a synthetic benchmark app that gener- 
ates a pre-defined number of various SQL queries to test 
SQLite performance on Android. 

Pulse News [24]: a popular reader app that fetches 
news articles from a number of websites and stores them 
locally. Our benchmark consists of the following steps: 
(a) open Pulse app, (b) wait until news fetching process 
completes, and (c) close the app. 

Background: another popular usage scenario 1s con- 
current execution of two or more applications (Android 
and iOS are both multi-threaded); several apps run in the 
background to periodically “sync” data with a remote ser- 
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scend, R: RiData, S: Sandisk, K: Kingston, W: Wintec, A: AData, P: Patriot, Y: PNY. Some graphs are plotted with a discontinuous 
y-axis to preserve clarity of the figure in presence of outliers like Kingston. 


vice or to provide proactive notifications. Our benchmark 
consists of the following set of apps in auto sync mode: 
Twitter, books, contacts, Gmail, Picasa, and calendar, and 
a set of active widgets: Pulse, news, weather, YouTube, 
calendar, Facebook, Market, and Twitter. 

For many of the above benchmarks (e.g., Facebook, 
Email, Pulse, Background), the actual contents and 
amount of data can vary across runs; we measure the total 
amount of data transferred and normalize the results per 
Megabyte. We also repeat the experiment several times to 
measure variations; for multiple iterations, the local appli- 
cation cache is deleted following each run. 


4 Performance Evaluation 


In this section we present detailed measurement results 
for application runtime performance, application launch 
times, concurrent app execution, and CPU consumption. 


4.1 Application Runtime Performance 


The first set of experiments compare the performance of 
WebBench on internal flash and the eight SD cards de- 
scribed earlier. Figure 4 shows the runtime of WebBench 
for WiFi and USB reverse tethering. 

Surprisingly, even with WiFi, we notice a 187% perfor- 
mance difference between the internal flash and RiData; 
for Kingston, the difference was a whooping 2040%. To 
ensure that the Kingston results were not due to a defec- 
tive device, we repeated the experiments with two more 
new Kingston cards from two different speed classes; we 
found the results to be similarly poor. Here onwards, so as 
to not rely on Kingston alone when making a claim about 
application performance, we mention the difference both 
with the second-worst and worst performing card for any 
given experiment. 

As expected, the faster the network (USB RT) the 
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Figure 6: SQLite I/O pattern. The left graph shows write I/O to the webcache directory contents on /data, on right are writes 
to SQLite database files; reads are comparatively less and omitted from presentation. 
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Figure 7: Application Launch. Launch times (secs) for several popular 
apps on 8 SD cards, internal flash, and a memory-backed RAMdisk. 


higher the impact of storage: 222% difference between 
internal and RiData, 2450% for Kingston. We find a sim- 
ilar trend for several popular apps; Figure 5 shows the 
results over WiFi for AppInstall, email, Google Maps, 
Facebook, RLBench, and Pulse. Since the phenomenon 
of storage and application performance correlation is 
clearly identifiable with existing WiFi networks, we here- 
after omit results for the USB network. The differ- 
ence between the best and worst case performance varies 
from 195% (225%) for AppInstall, 80% (1670%) for 
email, 60% (660%) for Maps, 80% (575%) for Facebook, 
130% (2210%) for RLBench, and 97% (168%) for Pulse; 
Kingston numbers are shown in parentheses. 


To better understand why storage affects application 
performance, we present in Table 5 presents a breakdown 
of the I/O activity during various workload runs. Amount 
of reads is less than writes for all workloads. In the case 
of WebBench roughly 1.3 times more data is written se- 
quentially than randomly. Since the difference between 
sequential and random performance is at least a factor 
of 3 for all SD cards (see Table 3), the time to complete 
the random writes dominates; the same holds true for the 
other applications in the table. Although not shown in the 
table, the /data partition receives most of the I/O, with 
only a few reads going to the /system partition. 

The disparity between sequential and random write 
performance is inherent with flash-based storage; our 
evaluation results suggest this to be one of the primary 
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Table 6: App Launch Summary. Total data 
(MB) read and written to storage and transferred 
over the network for the set of apps launched. 


reasons behind the slower performance. However, this 
still doesn’t explain the presence of the random writes 
and overwrites even for seemingly sequential application 
needs. In order to understand this we take a closer look at 
the applications and their usage of Android storage. 

The storage schema used by the browser application 
consists of the cache as the unstructured web cache stor- 
ing image and media files and two SQLite database files; 
webview.db is a database for application settings and pref- 
erences and webviewCache.db stores an index to manage 
the web cache. The database files are much smaller in 
size compared to the cache; in our setup, the cache con- 
sisted of 315 files totaling 6MB whereas the database files 
were 34KB and 137KB for webview.db and webview- 
Cache.db respectively. Figure 6 shows the write pattern 
to the web cache directory and the SQLite database files; 
web cache writes are mostly sequential with reuse of the 
same address space over time; SQLite exhibits a high 
degree of random writes and updates to the same block 
addresses. Since by default the database writes are syn- 
chronous, each write causes a (often unnecessary) delay. 


4.2 Application Launch 


Application launch is an important performance met- 
ric [47], especially for mobile users. Figure 7 shows the 
time taken to launch a number of Android applications on 
the various flash storage devices; Table 6 lists those apps 
along with a summary of disk I/O reads and writes, and 
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Figure 9: Storage and CPU activity for WebBench on fast and slow SD cards. The graph on the left shows instantaneous 
CPU utilization, memory consumption, and storage busyness during the course of a WebBench run on the fast Transcend card; the 
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Figure 8: Aggregate CPU for WebBench. Stacked bar 
shows active, idle, and ioWait times on Nexus One; 10Wait cor- 
relates with runtimes (Fig 4). Even active times vary across de- 
vices showing that some devices burn more CPU for same work! 


data transferred over the network during the launch. Most 
apps take a few seconds to launch, with games taking up- 
wards of 10 seconds. Larger apps (e.g., games) tend to 
take a noticeable amount of time to launch, contrary to 
the target of “significantly less than 1 second to launch a 
new app” [31]. As seen in Figure 7, barring a few excep- 
tions, the launch time varies between about 10% (for the 
Snowboard game) to 40% (for the Weather app); Twitter 
(120%) and Gmail (250%) showed the most variation. 

In order to ascertain the upper bound of launch time 
improvement through storage, we placed all application 
data on a RAMdisk; the test is conducted with the PNY 
card storing the /system, /sdcard, /cache par- 
titions and the /data partition mounted with tmpfs. 
To remove the effects of reading from /system and 
/sdcard, we warm the buffer cache; we verify the same 
by tracking all I/O to the flash storage. Launch times 
do not significantly change even when all data is being 
read from memory. Storage is likely not a significant con- 
tributor to app launch performance; research to speed up 
launch will perhaps benefit by focusing on other sources 
of delay such as application think time. 


4.3, Concurrent Applications 

Figure 10 shows I/O activity for a 7200 second run of the 
Background workload; during the period, the phone re- 
ceived about 1.6 MB of data over the network. Interest- 
ingly, the amount of data written to storage in the same 
period is 30 MB (a factor of roughly 20); the majority 
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Figure 10: Background I/O pattern. Breakdown of I/O 
issued by Background apps in 2 hours. 


of writes are for updating application-specific data and 
indices to the SQLite databases. Although the storage 
throughput requirement is quite low, the additional ran- 
dom writes can cause latency spikes for foreground ap- 
plications (not shown). With the Android development 
team’s desire to minimize application switch time and 
provide the appearance of “all applications running all of 
the time” [31] (see section: “When does an application 
*stop’?”) for mobile devices, handling concurrent appli- 
cations and their I/O demands can be an increasingly im- 
portant challenge in the future. 


4.4 CPU Consumption 
Figure 8 shows the breakdown of CPU utilization for 
WebBench; the stacked bar chart shows the CPU tick 
counts during active, idle, and ioWait periods (a “tick” 
corresponds to 10ms on our phone); Figure 9 shows the 
CPU utilization and I/O busyness for the same experi- 
ment for two SD cards: a fast Transcend, and a slow 
Kingston. Since the non-idle, non-ioWait CPU consump- 
tion includes not only the contribution of the benchmark 
but also all background activities, we also measured CPU 
consumption for background activities alone (to subtract 
from the total). Note that this is unlike the set of back- 
ground activities discussed in Section §3.3 as we turned 
off automatic syncing and active widgets; we find that the 
share of CPU consumption due to background tasks is less 
than 1% of the total. 

The graphs reveal the interesting phenomenon that ag- 
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Figure 11: What-If Performance Analysis. Experiments 
were conducted for WebBench (left) and Facebook (right); data 
stored in memory using a RAMdisk and RiData card as the flash 
backing store where needed (e.g., for baseline). Y-axis is Time 
in seconds; Solutions A: Baseline, B: Cache in RAM, C: DB in 
RAM, D: All in RAM, E: Disable Sync. 


gregate CPU consumed for the same benchmark increases 
with a slower storage device (by just looking at the “ac- 
tive’ component). This points to the fact that storage 
has an indirect impact on energy consumption by burn- 
ing more CPU. Ideally, one would expect a fixed amount 
of CPU to be consumed for the same amount of work; 
since the results show CPU consumption to be dispropor- 
tional to the amount of work, we hypothesize it being due 
to deficiencies in either the network subsystem, the stor- 
age subsystem, or both. We need to investigate this matter 
further to identify the root causes. 

Slower storage also increases energy consumption in 
other indirect ways; for example, keeping the LCD screen 
turned on longer while performing interactive tasks, keep- 
ing the WiFi radio busy longer, and preventing the phone 
from going to a low-power mode sooner. 


5 Pilot Solutions 


We present potential improvements in application perfor- 
mance through storage system modifications. We start 
with a what-if analysis to provide the envelope of perfor- 
mance gains and then present a set of pilot solutions. 


5.1 What-If Analysis 


The detailed analysis of storage performance gave in- 
sights into the performance problems faced by applica- 
tions, but before proposing actual solutions we wanted 
to understand the scope for potential improvements. We 
performed a set of what-if analyses to obtain the upper 
bounds on performance gains that could be achieved, for 
example, by storing all data in memory. For comparison 
sake, we performed experiments with both memory as the 
backing store (using RAMdisk) and SD cards as the back- 
ing store; in the different analysis experiments we placed 
different kinds of data on the RAMdisk, for example, the 
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cache, or the database files. Figure 11 compares the rel- 
ative benefits of the various approaches, as measured for 
the WebBench and Facebook workloads for the RiData 
card and a RAMdisk; the trends for the other SD cards 
were similar, although the actual gains were of course dif- 
ferent with every card. 

Placing the entire “cache” folder on RAM (bars B) 
does improve performance, but not by much (i.e., 5% for 
WebBench and 15% for Facebook). Placing the SQLite 
database on RAM (bars C) however improves perfor- 
mance by factors of three and two for WebBench and 
Facebook respectively; placing both the cache and the 
database on RAM (bars D) does not provide significant 
additional benefit. Transforming the cache and database 
writes to be asynchronous (bars E) recoups most of the 
performance and performs comparably to the SQLite on 
RAM solution. 

The performance evaluation in the previous section and 
the what-if analysis lead to the following conclusions: 
First, the key bottleneck is the “wimpy” storage preva- 
lent today on mobile devices; even while the internal flash 
and the SD cards are increasingly being used for desktop 
like-workloads, their performance is significantly worse 
than storage media on laptops and desktops. Second, 
the Android OS exacerbates the poor storage performance 
through its choice of interfaces; the synchronous SQLite 
interface primarily geared for ease of application develop- 
ment is being used by applications that are perhaps better 
off with more light-weight consistency solutions. Third, 
the SQLite write traffic itself is quite random with plenty 
of synchronous overwrites to the flash storage causing fur- 
ther slowdown. Finally, apps use the Android interfaces 
oblivious to performance. A particularly striking example 
is the heavy-handed management of application caches 
through SQLite; the web browser writes a cache map to 
SQLite significantly slowing down the cache writes. 

We implement and evaluate a set of pilot solutions to 
show the potential for improving user experience through 
improvements in the Android storage system; while not 
rigorous enough to serve as deployable solutions, these 
can evolve into robust and detailed solutions in the future. 
We classify the solution space into four categories: 


e Better storage media for mobile devices to provide 
baseline improvements 
e Firmware and device drivers to effectively utilize 
existing and upcoming storage devices 
e Enhancements to mobile OS to avoid the storage 
bottlenecks and provide new functionality 
e Application-level changes to judiciously use the 
supplied storage interfaces 
Figure 12 shows the improvements through the pilot so- 
lutions for WebBench and Facebook using Kingston and 
RiData; as with the what-if analysis, trends for other SD 
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Figure 12: Pilot Solutions. Runtime results for WebBench (leftmost two) and Facebook (rightmost two) for the Kingston and 
RiData cards; y-axis is Time in seconds. Solutions A: Baseline, B: RAID over SD, C: SQLite on Nilfs2, D: Selective Sync, E: 


SQLite on PCM, F: All in RAM. 


cards were similar but actual gains varied. Bars A in Fig- 
ure 12 represent the baseline performance, while bars F 
are meant to represent an upper bound on performance 
with all data stored in RAM. 


5.2 Storage Devices Not Wimpy Anymore 


An obvious solution is to improve the performance of the 
storage device, i.e., using better flash storage or a faster 
non-volatile memory such as PCM. Indeed, flash fabrica- 
tion technology itself is improving at a fair pace; scaling 
trends project flash to double in capacity every two years 
until the year 2016 [45]. However, when it comes to per- 
formance, cost pressures in the consumer market are driv- 
ing manufacturers to move away from the more reliable, 
higher performing SLC flash to the less reliable, lower 
performing MLC or TLC flash; this makes it harder to rely 
solely on improvements due to flash scaling. Our findings 
reveal that performance of a relatively small fraction of 
I/O traffic is responsible for a large fraction of overall ap- 
plication performance. A more efficient solution is thus 
to use the faster storage media as a persistent write buffer 
for the performance-sensitive I/O traffic: a small amount 
of PCM to buffer writes issued by the SQLite database 
can improve the performance. 

We built a simple PCM emulator for Android to evalu- 
ate our solution; the emulator is implemented as a pseudo 
block-device based on the timing specifications from re- 
cent work [28], using memory as the backing store. The 
PCM buffer can be used as staging area for all writes or as 
the final location for the SQLite databases; our emulator 
can be configured with a small number of device-specific 
parameters. Figure 12 (bars E) show the performance im- 
provements by using a small amount (16 MB) of PCM; in 
this experiment, PCM is used as the final location for only 
the database files. 

An alternative approach, as envisioned by Pocket 
Cloudlets [51], is to rely on substantial augmentation 
of existing flash storage capabilities on mobile devices 
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Figure 13: Explanation of RAID Speedup. Variation in 


throughput for SD cards with increasing write address range. 


and/or full replacement of flash with PCM or STT- 
MRAM [43]. In reality, storage-class memory may be 
placed in different forms on the mobile system, for ex- 
ample, on the CPU-memory bus, or as backing store for 
the virtual memory. Our intent here was two-fold (a) un- 
derstand the approximate benefits of using such a persis- 
tent buffer, and (b) demonstrate that even with a relatively 
small amount of PCM, significant gains can be made by 
judiciously storing performance-critical data; a deployed 
solution can certainly incorporate PCM in the storage hi- 
erarchy in better ways. 


5.3. RAID over SD 


Another solution is to leverage the I/O parallelism already 
existent on most phones: an internal flash drive and an ex- 
ternal SD card. We built a simple software RAID driver 
for Android with I/O striped to the two devices (RAID- 
0) in 4 KB blocks. Note that a deployable solution will 
require more effort: (a) it would need to handle storage 
devices of potentially differing speeds (b) handle acciden- 
tal removal of the external SD card. 

While for some SD cards we obtained the expected 1m- 
provements as in Figure 12 (bars B), i.e., greater than 1X 
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and less than 2X, for others we obtained a speedup greater 
than 2X (not shown); we suspected that this could be due 
to the idiosyncrasies of the FTL on the card. As many con- 
sumer flash devices employ the log-block wear-leveling 
scheme [48], their performance is sensitive to the write 
footprint; a reduction in the amount of random writes re- 
duces the overhead of the garbage collection, improving 
the performance. 

To verify our hypothesis, we performed another exper- 
iment. Figure 13 shows the throughput obtained for an 
increasing address range with random writes; the I/O size 
is 4KB and number of requests is 2048, totaling 8 MB 
of writes. In order to minimize the effects of FTL state 
being carried forward from the previous experiment, we 
sequentially write 1 GB of data before every run. 

For Kingston, Wintec, A-Data, Patriot, and PNY, as 
the address range increases, the throughput drops signif- 
icantly and then stabilizes at the low level; for RiData, 
throughput drops but not as sharply, while for Transcend 
the throughput remains consistently high (we do not have 
an explanation for the slight increase, multiple measure- 
ments gave similar results). Sandisk exhibits more than 
one regime change, dropping first around the 32 MB mark 
and then around the 1024 MB mark. 

To explain our surprising performance improvements, 
in a log-block FTL, a small number of physical blocks are 
available for use as log blocks to stage an updated block; 
a one-to-one correspondence exists between logical and 
physical blocks. Since the amount of data written to one 
disk in a 2-disk RAID-O array is roughly half of the total, 
the disk write footprint reduces and block address range 
shrinks; the RAID scheme simply pushes the operating 
regime of an SD card towards the left, and depending on 
the actual footprint, provides super-linear speedup! While 
we came across this performance variation in course of 
our RAID experiments, the implications are more generic; 
one can design other solutions centered around the com- 
paction of the write address range. 


5.4 Using a Log-structured File System 
Log-structured file systems provide good performance 
for random writes [58]; another solution to alleviate 
the effects of the random writes is thus to place the 
database files on a log-structured file system. We used 
the Nilfs2 [50] file system on Android since it works 
with block devices; we created a separate partition on the 
phone’s flash storage to store the entire SQLite database. 
Figure 12 (bars C) show the benefits of log-structuring; 
SQLite on Nilfs2 improves the performance of WebBench 
and Facebook by more than a factor of 4 for Kingston, and 
over 20% for RiData. 


5.5 Application Modifications 
Finally, several solutions are possible if one is able to 
modify either the SQLite interface or the applications 
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themselves. We demonstrate the benefits of such tech- 
niques with a simple modification to SQLite: providing 
the capability to perform selective sync operations based 
on application-specific requirements; in our current 1m- 
plementation, we simply turn off sync for the database 
files that are deemed asynchronous as per our analysis (for 
example, the WebView database file serving as the index 
for the web cache). Figure 12 (bars D) compare the ben- 
efits of the selective sync operation with other previously 
proposed solutions, providing noteworthy benefits espe- 
cially for Facebook. 

Another potential technique to improve performance at 
the application level is through the use of larger transac- 
tions, amortizing the overhead of the SQLite sync inter- 
face. A careful restructuring of the application program- 
ming interface can perhaps lead to significant gains for 
future apps, but is beyond the scope for this paper; the in- 
terface discussion is a classic chicken-and-egg problem in 
the context of storage systems [54, 63]. Recently a new 
backend for SQLite has been proposed that uses write- 
ahead logging [18]; such techniques have the potential to 
ameliorate the random write bottleneck without requiring 
changes to the API. 


5.6 Summary of Solutions 

Through our investigation of the solution space we notice 
several avenues for further performance improvements 
in the storage subsystem on mobile devices, and conse- 
quently the end-user experience. Our analysis reveals that 
a small amount of domain or application knowledge can 
improve performance in a more efficient way; through our 
pilot solutions we demonstrate the potential benefits of ex- 
plicit and implicit storage improvements. 

Programmers tend to heavily use the general-purpose 
““all-synchronous” SQLite interface for its ease of use but 
end up suffering from performance shortcomings. We 
posit that a data-oriented I/O interface would be one that 
enables the programmer to specify the I/O requirements 
in terms of its reliability, consistency, and the property of 
the data, i.e., temporary, permanent, or cache data, with- 
out worrying about how its stored underneath. For ex- 
ample, a key-value store specifically for cache data does 
not need to provide ultra-reliability; a web browser can 
use the cache key-value store as its web cache in a more 
performance-efficient manner than SQLite. 


6 Related Work 


We found little published literature on storage perfor- 
mance for mobile devices. One of the earliest works on 
storage for mobile computers [33] compares the perfor- 
mance of hard disks and flash storage on an HP Omni- 
Book; remarkably, many of their general observations are 
still valid. Datalight [11], provider of data management 
technologies for mobile and embedded devices to OEMs, 
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make an observation similar to ours with reference to their 
proprietary Reliance Nitro file system. According to their 
website, lack of device performance and responsiveness 1s 
one of the important shortcomings of the [Windows] Mo- 
bile platforms; OEMs using an optimized software stack 
can improve performance. Our results also reaffirm some 
of the recent findings for desktop applications on the Mac 
OS X [42]: lack of pure sequential access for seemingly 
sequential application requests, heavy-handed use of syn- 
chronization primitives, and the influence of underlying 
libraries on application I/O. 

A recent study of web browsers on smartphones [67] 
examined the reasons behind slow web browsing per- 
formance and found that optimizations centering around 
compute-intensive operations provide only marginal im- 
provements; instead “resource loading” (e.g., files of vari- 
ous types being fetched from the webserver) contributes 
most to browser delay. While this work focuses more 
specifically on the browser and the network, it reaffirms 
the observation that improvements in the OS and hard- 
ware are needed to improve application performance. 

Other related work has focused on the implications 
of network performance on smartphone applications [44] 
and on the diversity of smartphone usage [35]. Finally, 
there is extensive work in developing smarter, richer, and 
more powerful applications for mobile devices, far too 
much to cite here. We believe the needs of these appli- 
cations are in turn going to drive the performance require- 
ments expected of hardware devices, including storage, as 
well as the operating system software. 


7 Conclusions 


Contrary to conventional wisdom, we find evidence that 
storage is a significant contributor to application perfor- 
mance on mobile devices; our experiments provide insight 
into the Android storage stack and reveal its correlation 
with application performance. Surprisingly, we find that 
even for an interactive application such as web browsing, 
storage can affect the performance in non-trivial ways; for 
I/O intensive applications, the effects can get much more 
pronounced. With the advent of faster networks and I/O 
interconnects on the one hand, and a more diverse, pow- 
erful set of mobile apps on the other, the performance re- 
quired from storage 1s going to increase in the future. We 
believe the storage system on mobile devices needs a fresh 
look and we have taken the first steps in this direction. 
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Serving Large-scale Batch Computed Data with Project Voldemort 


Roshan Sumbaly Jay Kreps LeiGao Alex Feinberg Chinmay Soman Sam Shah 
LinkedIn 
Abstract You May Know” feature on LinkedIn runs on hundreds of 


Current serving systems lack the ability to bulk load 
massive immutable data sets without affecting serving 
performance. The performance degradation 1s largely due 
to index creation and modification as CPU and memory 
resources are shared with request serving. We have ex- 
tended Project Voldemort, a general-purpose distributed 
storage and serving system inspired by Amazon’s Dy- 
namo, to support bulk loading terabytes of read-only data. 
This extension constructs the index offline, by leveraging 
the fault tolerance and parallelism of Hadoop. Compared 
to MySQL, our compact storage format and data deploy- 
ment pipeline scales to twice the request throughput while 
maintaining sub 5 ms median latency. At LinkedIn, the 
largest professional social network, this system has been 
running in production for more than 2 years and serves 
many of the data-intensive social features on the site. 


1 Introduction 


Many social networking and e-commerce web sites con- 
tain data-derived features, which usually consist of some 
data mining application offering insights to the user. Typi- 
cal features include: “People You May Know,” a link pre- 
diction system attempting to find other users you might 
know on the social network (Figure la); collaborative 
filtering, which showcases relationships between pairs 
of items based on the wisdom of the crowd (Figure 1b); 
various entity recommendations; and more. LinkedIn, the 
largest professional social network with, as of writing, 
more than 135 million members, consists of these and 
more than 20 other data-derived features. 

The feature data cycle in this context consists of a con- 
tinuous chain of three phases: data collection, processing, 
and serving. The data collection phase usually involves 
log consumption, while the processing phase involves 
running algorithms on the output. Algorithms such as 
link prediction or nearest-neighbor computation output 
hundreds of results per user. For example, the “People 


terabytes of offline data daily to make these predictions. 

Due to the dynamic nature of the social graph, this 
derived data changes extremely frequently—requiring 
an almost complete refresh and bulk load of the data, 
while continuing to serve existing traffic with minimal 
additional latency. Naturally, this batch update should 
complete quickly to engender frequent pushes. 

Interestingly, the nature of this complete cycle means 
that live updates are not necessary and are usually handled 
by auxiliary data structures. In the collaborative filtering 
use case, the data is purely static. In the case of “People 
You May Know’, dismissed recommendations (marked 
by clicking “X”’) are stored in a separate data store with 
the difference between the computed recommendations 
and these dismissals calculated at page load. 

This paper presents read-only extensions to Project 
Voldemort, our key-value solution for the final serving 
phase of this cycle and discusses how it fits into our fea- 
ture ecosystem. Voldemort, which was inspired by Ama- 
zon’s Dynamo [7], was originally designed to support 
fast online read-writes. Our system leverages a Hadoop 
elastic batch computing infrastructure to build its index 
and data files, thereby supporting high throughput for 
batch refreshes. A custom read-only storage engine plugs 
into Voldemort’s extensible storage layer. The Voldemort 
infrastructure then provides excellent live serving perfor- 
mance for this batch output—even during data refreshes. 

Our system supports quick rollback, where data can 
be restored to a clean copy, minimizing the time in error 
if an algorithm should go awry. This helps support fast, 
iterative development necessary for new feature improve- 
ments. The storage data layout also provides the ability 
to grow horizontally by rebalancing existing data to new 
nodes without downtime. 

Our system supports twice the request throughput ver- 
sus MySQL while serving read requests with a median 
latency of less than 5 ms. At LinkedIn, this system has 
been running for over 2 years, with one of our largest 
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Figure 1: (a) The “People You May Know” module. (b) An example 
collaborative filtering module. 


clusters loading more than 4 terabytes of new data to the 
site every day. 


The key contributions of this work are: 


e A scalable offline index construction, based on 
MapReduce [6], which produces partitioned data 
for online consumption 

e Complete data cycle to refresh terabytes of data with 
minimum effect on existing serving latency 

e Custom storage format for static data, which lever- 
ages the operating system’s page cache for cache 
management 


Voldemort and its read-only extensions are open source 
and are freely available under the Apache 2.0 license. 


The rest of the paper is as follows. Section 2 first 
discusses related work. We then provide an architectural 
overview of Voldemort in Section 3. We follow with 
a discussion in Section 4 of existing solutions that we 
tried, but found insufficient for bulk loading and serving 
largely static data. Section 5 describes Voldemort’s read- 
only extensions, including our new storage format and 
how data and indexes are built offline and loaded into the 
system. Section 6 presents experimental and production 
results evaluating our solution. We close with a discussion 
of future directions in Section 7. 
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2 Related Work 


MySQL [16] is a common serving system used in various 
companies. The two most commonly used MySQL stor- 
age engines, MyISAM and InnoDB, provide bulk load- 
ing capabilities into a live system with the LOAD DATA 
INF ILE statement. MyISAM provides a compact on- 
disk structure and the ability to delay recreation of the 
index after the load. However, these benefits come at the 
expense of requiring considerable memory to maintain a 
special tree-like cache during bulk loading. Additionally, 
the MyISAM storage engine locks the complete table for 
the duration of the load, resulting in queued requests. In 
comparison, InnoDB supports row-level locking, but its 
on-disk structure requires considerable disk space and 
its bulk loading is orders of magnitude slower than My- 
ISAM. 

Considerable work has been done to add bulk loading 
ability to new shared nothing [22] cluster databases sim- 
ilar to Voldemort. Silberstein et al. [19] introduce the 
problem of bulk insertion into range-partitioned tables 
in PNUTS [4], which tries to optimize data movement 
between machines and total transfer time by adding an 
extra planning phase to gather statistics and prepare the 
system for the incoming workload. In an extension of 
that work [20], Hadoop is used to batch insert data into 
PNUTS in the reduce phase. Both of these approaches 
optimize the time for data loading into the live system, 
but incur latency degradation on live serving due to multi- 
tenant issues with sharing CPU and memory during the 
full loading process. This is a significant problem with 
very large data sets, which even after optimizations, might 
take hours to load. 

Our system alleviates this problem by moving the con- 
struction of the indexes to an offline system. MapRe- 
duce [6] has been used for this offline construction in 
various search systems [14]. These search layers trigger 
builds on Hadoop to generate indexes, and on completion, 
pull the indexes to serve search requests. 

This approach has also been extended to various 
databases. Konstantinou et al. [10] and Barbuzzi et al. [2] 
suggest building HFiles offline in Hadoop, then shipping 
them to HBase [9], an open source database modeled af- 
ter BigTable [3]. These works do not explore the data 
pipeline, particularly data refreshes and rollback. 

The overall architecture of Voldemort was inspired 
from various DHT storage systems. Unlike the previ- 
ous DHT systems, such as Chord [21], which provide 
O(log N) lookup, Voldemort’s lookups are O(1) be- 
cause the complete cluster topology is stored on every 
node. This information allows clients to bootstrap from 
a random node and direct requests to exact destination 
nodes. Similar to Dynamo [7], Voldemort also supports 
per tuple-based replication for availability purposes. Up- 
dating replicas is easy in the batch scenario because they 
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are precomputed and loaded into the Voldemort cluster at 
once. The novelty of Voldemort, compared to Dynamo, 
is our custom storage engine for bulk-loaded data sets. 


3 Project Voldemort 


A Voldemort cluster can contain multiple nodes, each 
with a unique identifier. A physical host can run multi- 
ple nodes, though at LinkedIn we maintain a one-to-one 
mapping. All nodes in the cluster have the same number 
of stores, which correspond to database tables. General 
usage patterns have shown that a site-facing feature can 
map to one or more stores. For example, a feature dealing 
with group recommendations will map to two stores: one 
recording a member id to recommended group ids and 
another recording a group id to its corresponding descrip- 
tion. Every store has the following list of configurable 
parameters, which are identical to Dynamo’s parameters: 

e Replication factor (N): Number of nodes which 
each key-value tuple is replicated. 

e Required reads (R): Number of nodes Voldemort 
reads from, in parallel, during a get before declaring 
a SUCCESS. 

e Required writes (W): Number of node responses 
Voldemort blocks for, before declaring success dur- 
ing a put. 

e Key/Value serialization and compression: Voldemort 
can have different serialization schemas for key and 
value. For the custom batch data use case, Voldemort 
uses a custom binary JSON format. Voldemort also 
supports per tuple-based compression. Serialization 
and compression is completely handled by a com- 
ponent that resides on the client side with the server 
only dealing with raw byte arrays. 

e Storage engine type: Voldemort supports various 
read-write storage engine formats: Berkeley DB 
Java Edition [15] and MySQL [16]. Voldemort also 
supports a custom read-only storage engine for bulk- 
loaded data. 

Every node in the cluster stores the same 2 pieces of 
metadata: the complete cluster topology and the store 
definitions. 

Voldemort has a pluggable architecture, as shown in 
Figure 2. Each box represents a module, all of which 
share the same code interface. Each module has exactly 
one functionality, making it easy to interchange modules. 
For example, we can have the routing module on either 
the client side or the server side. Functional separation 
at the module level also allows us to easily mock these 
modules for testing purposes—for example, a mocked up 
storage engine backed by a hash map for unit tests. 

Many of our modules have been inspired by the original 
Dynamo paper. Starting from the top of the Voldemort 
stack, our client has a simple get and put API. Every 
tuple is replicated for availability, with each value having 
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Figure 2: Voldemort architecture containing modules for a single client 
and server. The dotted modules are not used by the read-only storage 
engine. 
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Figure 3: Simple hash ring cluster topology for 3 nodes and 12 par- 
titions. The preference list generation for a key hashing to partition 
11, for a store with N=2, would jump the ring clockwise to place the 
other VV —1=1 replica on partition 0. The table shows the preference 
list generated for every hashed partition. The primary partitions have 
been highlighted in bold. 


9 


vector clock [11] versioning. The “conflict resolution’ 
and “repair mechanism” layer, used only by the read-write 
storage engines, deal with inconsistent replicas. This does 
not apply to read-only stores because Voldemort updates 
all the replicas of a key in a store at once, keeping them 
in sync. 

The “routing” module deals with partitioning as well 
as replication. Our partitioning scheme is similar to Dy- 
namo’s, wherein Voldemort splits the hash ring into equal 
size partitions, assigns them unique ids, and then maps 
them to nodes. This ring is then shared with all the stores; 
that is, changes in the mapping require changes to all the 
stores. To generate the preference list (the list of partition 
ids where the replicas will be stored), we first hash the key 
(using MDS) to a range belonging to a partition and then 
continue jumping the ring clockwise to find N —1 parti- 
tions belonging to different nodes. For example, for a 
store with V=2 and partition mapping as shown in Fig- 
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ure 3, the preference list for a key hashing to partition 11 
will be (Partition 11, Partition 0). 


The last module, the pluggable storage layer, has the 
same get and put functions, along with the ability to 
stream data out. In addition to running the full stack 
from Figure 2, every node also runs an administrative 
service that allows the execution of following privileged 
commands: add or remove a Store, stream data out, and 
trigger read-only store operations. 


Voldemort supports two routing modes: server-side and 
client-side routing. Client-side routing, the more com- 
monly used routing strategy, requires an initial “bootstrap” 
step, wherein it retrieves the metadata required for routing 
(cluster topology and store definitions) by load balancing 
to a random node. Once the metadata has been retrieved 
by the client, one fewer hop is necessary compared to 
server-side routing, because the replica locations can be 
calculated on the fly. However, as we will further explain 
in Section 5.7, client-side routing makes rebalancing of 
data complicated, because we now need a mechanism to 
update the cluster topology metadata on the live clients. 


4 Alternative Approaches 


Before we started building our own custom storage engine, 
we decided to evaluate the existing read-write storage 
engines supported in Voldemort, namely, MySQL and 
Berkeley DB. Our criteria for success was the ability 
to bulk load massive data sets with minimal disk space 
overhead, while still serving live traffic. 


4.1 Shortcomings of Alternative Approaches 


The first approach we tried was to perform multiple put 
requests. This naive approach is problematic as every 
request results in an incremental change to the underly- 
ing index structure (in most cases, a B+ tree), which in 
turn, results in many disk seeks. To solve this problem, 
MySQL provides a LOAD DATA statement that tries to 
bulk update the underlying index. Unfortunately, using 
this statement for the MyISAM storage engine locks the 
entire table. InnoDB instead executes this statement with 
row-level locking, but experiences substantial disk space 
overhead for every tuple. However, to achieve MyISAM- 
like bulk loading performance, InnoDB prefers data or- 
dered by primary key. Achieving fast load times with low 
space overhead in Berkeley DB requires several manual 
and non-scalable configuration changes, such as shutting 
down cleaner and checkpointer threads. 


The next solution we explored was to bulk load into 
a different MySQL table on the same cluster and use 
views to transparently swap to the new table. We used 
the MyISAM storage engine, opting to skip InnoDB due 
to the large space requirements. This approach solves 
the locking problem, but still hurts serving latency during 
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the load due to pressure on shared CPU and memory 
resources. 

We then tried completely offloading the index construc- 
tion to another system as building the index on the serving 
system has isolation problems. We leveraged the fact that 
MyISAM allows copying of database files from another 
node into a live database directory, automatically making 
it available for serving. We bulk load to a separate cluster 
and then copy the resulting database files over to the live 
cluster. This two-step approach requires the extra main- 
tenance cost of a separate MySQL cluster with exactly 
the same number of nodes as the live one. Additionally, 
the inability to load compressed data in the bulk load 
phase means data is copied multiple times between nodes: 
first, as a flat file to the bulk load cluster; then as an in- 
ternal copy during the LOAD statement; and finally, as a 
raw database file copy to the actual live database. These 
copies make the load more time-consuming. 

The previous solution was not ideal, due to its depen- 
dency on redundant MySQL servers and the resulting 
vulnerability to failure downtime. To address this short- 
coming, the next attempted approach used the inherent 
fault tolerance and parallelism of Hadoop and built in- 
dividual node/partition-level data stores, which could be 
transferred to Voldemort for serving. A Hadoop job reads 
data from a source in HDFS [18], repartitions it on a 
per-node basis, and finally writes the data to individual 
storage engines (for example, Berkeley DB) on the local 
filesystem of the reducer phase Hadoop nodes. The num- 
ber of reducers equals the number of Voldemort nodes, 
but could have easily been further split on a per-partition 
basis. This data is then read from the local filesystem and 
copied onto HDFS, where it can be fetched by Voldemort. 
The benefit of this approach is that it leverages Hadoop’s 
parallelism to build the indexes offline; however, it suf- 
fers from an extra copy from the local filesystem on the 
reducer nodes to HDEFS, which can become a bottleneck 
with terabytes of data. 


4.2 Requirements 


The lack of off-the-shelf solutions, along with the in- 
efficiencies of the previous experiments, motivated the 
building of a new storage engine and deployment pipeline 
with the following properties. 


e Minimal performance impact on live requests: The 
incoming get requests to the live store must not be 
impacted during the bulk load. There is a trade- 
off between modifying the current index on the live 
server and a fast bulk load—quicker bulk loads result 
in increased I/O, which in turn hurts performance. 
As aresult, we should completely rebuild the index 
offline and also throttle fetches to Voldemort. 

e Fault tolerance and scalability: Every step of the 
data load pipeline should handle failures and also 
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scale horizontally to support future expansion with- 
out downtime. 

Rollback capability: The general trend we notice in 
our business 1s that incorrect or incomplete data due 
to algorithm changes or source data problems needs 
immediate remediation. In such scenarios, running 
a long batch load job to repopulate correct data is 
not acceptable. To minimize the time in error, our 
storage engine must support very fast rollback to a 
previous good state. 

Ability to handle large data sets: The easy access 
to scalable computing through Hadoop, along with 
the growing use of complex algorithms has resulted 
in large data sets being used as part of many core 
products. Classic examples of this, in the context 
of social networks, include storing relationships be- 
tween a pair of users, or between users and an entity. 
When dealing with millions of users, these pairs can 
easily reach billions of tuples, motivating our storage 
engine to support terabytes of data and perform well 
under a large data to memory ratio. 


5 Read-only Extensions 


To satisfy the requirements laid out in Section 4.2, we 
built a new data deployment pipeline as shown in Fig- 
ure 4. We use the existing Voldemort architecture to plug 
in a new storage engine with a compact custom format 
(Section 5.1). For many of LinkedIn’s user-facing fea- 
tures, data is generated by algorithms run on Hadoop. For 
example, the “People You May Know” feature runs a com- 
plex series of Hadoop jobs on log data. We thus leverage 
Hadoop as the computation layer for building the index as 
its MapReduce component handles failures while HDFS 
replication provides availability. After the algorithm’s 
computation completes, a driver program coordinates a 
refresh of the data. As shown in steps | and 2 in Figure 4, 
it triggers a build of the output data in our custom storage 
format and stores it on HDFS (Section 5.2). This data is 
kept in versioned format (Section 5.3) after being fetched 
by Voldemort nodes in parallel (Section 5.4), as demon- 
strated in steps 3 and 4. Once fetched and swapped in, 
as displayed in steps 5 and 6, the data on the Voldemort 
nodes is ready for serving (Section 5.5). This section 
describes this procedure in detail. We also discuss real 
world production scenarios such as data schema changes 
(Section 5.6) and the no-downtime addition of new nodes 
(Section 5.7). 


5.1 Storage Format 


Many storage formats try to build data structures that 
keep the data memory resident in the process’s address 
space, ignoring the effects of the operating system’s page 
cache. The several orders of magnitude latency gap be- 
tween page cache and disk means that most of the real 
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Figure 4: Steps involved in the complete data deployment pipeline. The 
components involved include Hadoop, HDFS, Voldemort, and a driver 
program coordinating the full process. The “build” step works on the 
output of the algorithm’s job pipeline. 
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Figure 5: Read-only data is split into multiple chunk buckets, each of 
which is further split into multiple chunk sets. A chunk set contains an 
index file and a data file. The diagram shows the data layout in these 
files. The numbers at the top are sizes in bytes. 


performance benefit by maintaining our own structure is 
for elements already in the page cache. In fact, this cus- 
tom structure may even start taking memory away from 
the page cache. This potential interference motivated the 
need for our storage engine to exploit the page cache in- 
stead of maintaining our own complex heap-based data 
structure. Because our data is immutable, Voldemort 
memory maps the entire index into the address space. Ad- 
ditionally, because Voldemort is written in Java and runs 
on the JVM, delegating the memory management to the 
operating system eases garbage collection tuning. 

To take advantage of the parallelism in Hadoop during 
generation, we split the input data destined for a particular 
node into multiple chunk buckets, which in turn are split 
into multiple chunk sets. Generation of multiple chunk 
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Chunk buckets 
0_0, 3_0, 6_0, 9_0, 2-1, 5_1, 8_1, 11_1 


1_0, 4_0, 7_0, 10_0, O_1, 3_1, 6_1, 9_1 
20, 5_0, 8_0, 11_0, 1-1, 4-1, 7_1, 10_1 





Table 1: Every Voldemort node is responsible for chunk buckets based 
on the primary partition and replica id. This table shows the node id to 
chunk bucket mapping for the cluster topology defined in Figure 3. 


sets can then be done independently and in parallel. A 
chunk bucket is defined by the primary partition id and 
replica id, thereby giving it a unique identifier across all 
nodes. For a store with N=2, the replica id would be 
either O for the primary replica or 1 for the secondary 
replica. For example, the hashed key in Figure 3 would 
fall into buckets 11_0 (on node 2) and 11_1 (on node 0). 
Table 1 summarizes the various chunk buckets for a store 
with NV =2 and cluster topology as shown in Figure 3. Our 
initial design had started with the simpler design of having 
one chunk bucket per-node (that is, multiple chunk sets 
stored on a node with no knowledge of partition/replica), 
but the current smaller granularity is necessary to aid in 
rebalancing (Section 5.7). 

The number of chunk sets per bucket is decided dur- 
ing generation on the Hadoop side. The default value 
is one chunk set per bucket, but can be increased by 
the store owner for more parallelism. The only lim- 
itation is that a very large value for this parameter 
would result in multiple small-sized files—a scenario 
that HDFS does not handle efficiently. As shown in 
Figure 5, a chunk set includes a data file and an index 
file. The standard naming convention for all our chunk 
sets is partition id_replica id_chunk set id.{ data, index } 
where partition id 1s the id of the primary partition, replica 
id is anumber between 0 to N—1, and chunk set id is a 
number between 0 to the predefined number of sets per 
bucket—1. 

The index file is a compact structure containing the 
sorted upper 8 bytes of the MD5 of the key followed by 
the 4 byte offset of the corresponding value in the data 
file. This simple sorted structure allows us to leverage 
Hadoop’s ability to return sorted data in the reducers. Fur- 
ther, preliminary tests also showed that the index files 
were generally orders of magnitude smaller than the data 
files and hence, could fit into the page cache. The use 
of MD5, instead of any other hash function yielding uni- 
formly distributed values, was an optimization to reuse 
the calculation from the generation of the preference list. 

We had initially started by using the full 16 bytes of 
the MDS signature, but saw performance problems as 
the number of stores grew. In particular, the indexes for 
all stores were not page cache resident, and thrashing 
behavior was seen for certain stores due to other high- 
throughput stores. To alleviate this problem, we needed 
to cut down on the amount of data being memory mapped, 
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which could be achieved by reducing the available key- 
space and accepting collisions in the data file. 

Our optimization to decrease key-space can be mapped 
to the classic birthday paradox: if we want to retrieve 
nm random integers from a uniform distribution of range 
/1, a], the probability that at least 2 numbers are the same 
1S: 


—n(n—1) 
2x 


l—e (1) 


Mapping this to our common scenario of stores keyed by 
member id, 7 is our 135 million member user base, while 
the initial value of x is 2!7° — 1 (16 bytes of MDS). The 
probability of collision in this scenario is close to 0. A 
key-space of 4 bytes (that is, 32 bits) yields an extremely 
high collision probability of: 


(=135410° «(1395410° =1) 
2* (232 —1) 





l—e ~ 1 (2) 
Instead, a compromise of 8 bytes (that is, 64 bits) pro- 
duces: 


(—135«108 *(135*108 —1) 
2% (2641) 





l—e < 0.0004 (3) 
The probability of more than one collision is even smaller. 
As a result, by decreasing the number of bytes of the 
MDS of the key, we were able to cut down the index size 
by 40%, allowing more stores to fit into the page cache. 
The key-space size is an optional parameter the store 
owner can set depending on the semantics of the data. 
Unfortunately, this optimization came at the expense of 
having to save the keys in the data file to use for lookups 
and handle rare collisions. 

The data file is also a very highly-packed structure 
where we store the number of collided tuples followed by 
a list of collided tuples (key size, value size, key, value). 
The order of these multiple lists is the same as the corre- 
sponding 8 bytes of MD5 of key in the index file. Here, 
we need to store the key bytes instead of the MD5 in the 
tuples to distinguish collided tuples during reads. 


5.2 Chunk Set Generation 


Construction of the chunk sets for all the Voldemort nodes 
is a single Hadoop job; the pseudo-code representation 
is Shown in Figure 6. The Hadoop job takes as its input 
the number of chunk sets per bucket, cluster topology, 
store definition, and the input data location on HDFS. The 
job then takes care of replication and partitioning, finally 
saving the data into separate node-based directories. 

At a high level, the mapper phase deals with the parti- 
tioning of the data depending on the routing strategy; the 
partitioner phase redirects the key to the correct reducer 
and the reducer phase deals with writing the data to a sin- 
gle chunk set. Due to Hadoop’s generic InputFormat 
mechanics, any source data can be converted to Volde- 
mort’s serialization format. The mapper phase emits the 
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Global Input: Num Chunk Sets: Number of chunk sets per bucket 
Global Input: Replication Factor: Tuple replicas for the store 
Global Input: Cluster: The cluster topology 

Function: TopBytes(x,n): Returns top n bytes of x 

Function: MD5(x): Returns MD5 of x 

Function: PreferenceList(x): Partition list for key x 

Function: Size(x): Return size in bytes 

Function: Make*(x): Convert x to Voldemort serialization 


Input: K/V: Key/value from HDFS files 
Data: K’/V’: Transformed key/value into Voldemort serialization 
map (K, V) 
K’ + MakeKey(K) 
Vv’ + MakeValue(V) 
Replica Id + 0 
MDSK’ + MDS5S(K’) 
KOut + TopBytes( MDSK’, 8 ) 
foreach Partition Id € PreferenceList(MD5K’) do 
Node Id < PartitionToNode(Partition Id) 
emit(KOut, [Node Id, Partition Id, Replica Id, K’, V’]) 
Replica Id + Replica Id + 1 
end 
end 


Input: K: Top 8 bytes of MD5 of Voldemort key 
Input: V: [Node Id, Partition Id, Replica Id, K’, V’] 
partition (K, V): Integer 
Chunk Set Id <- TopBytes( MD5(V.K’ ), SizeUInteger) ) 
% Num Chunk Sets 
Bucket Id + V.Partition Id * Replication Factor + 
V.Replica Id 


return Bucket Id * Num Chunk Sets + Chunk Set Id 
end 


Input: K/V: Same as partitioner 
Data: Position: Continuous offset into data file. Initialized to 0 
reduce (K, Iterator<V> Values) 
WriteIndexFile(K) 
WriteIndexFile(Position) 
WriteDataFile(Values.length) 
Position += Size(Short) 
foreach V © Values do 
WriteDataFile( Size(V.K’) ) 
WriteDataFile( Size(V.V’) ) 
WriteDataFile(V.K’ ) 
WriteDataFile(V.V’ ) 
Position += Size(V.K’) + Size(V.V’) + Size(2* Integer) 
end 
end 


Figure 6: MapReduce pseudo-code used for chunk set generation. 


upper 8 bytes of the MD5 of the Voldemort key N times 
as the map phase key with the map phase value equal to a 
grouped tuple of node id, partition id, replica id, and the 
Voldemort key and value. 


The custom partitioner generates the chunk set id 
within a chunk bucket from this key. Due to the fair 
distribution of MD5, we partition the data destined for 
a bucket into sets with a mod of the 4 bytes of MDS by 
the predefined number of chunk sets per bucket. This 
generated chunk set id, along with the partition id and 
replication factor of the store, is used to route the data 
further to the correct reducer. 


Finally, every reducer is responsible for a single chunk 
set, meaning that by having more chunk sets, build phase 


parallelism can be increased. Hadoop automatically sorts 
input based on the key in the reduce phase, so data arrives 
in the order necessary for the index and data files, which 
can be constructed as simple appends on HDFS with no 
extra processing required. The data layout on HDFS is a 
directory for each Voldemort node, with the nomenclature 
of node-id. 


5.3. Data Versioning 


Before we describe how the generated chunk set files 
are copied from HDFS, it is essential to understand their 
storage layout on the Voldemort nodes. This layout is 
crucial because one of our requirements is the ability to 
perform instantaneous rollback of data. That is, every 
time a new copy of the complete data set is created, the 
system needs to demote the previous copy to an earlier 
State. 

Every store is represented by a directory, which in turn 
contains directories corresponding to “versions” of the 
data. A symbolic link per store is used to point to the 
current serving version directory. Because the data in 
all version directories except the serving one is inactive, 
we are not affecting page cache usage and latency. Also, 
with disks becoming cheaper and providing very fast se- 
quential writes compared to random reads, keeping these 
previous copies (the number of which is configurable) 
is beneficial for quick rollback. Every version directory 
(named version-—no) has a configurable number as- 
sociated with it, which should monotonically increase 
with every new fetch. A commonly used example for the 
version number is the timestamp of push. 

Swapping in a new data version on a single node is 
done as follows: copy into a new version directory, close 
the current set of active chunk set files, open the chunk set 
files from the new version, memory map all the index files, 
and change the symbolic link to the new version. The 
entire operation is coordinated using a read-write lock. A 
rollback follows the same sequence of steps, except that 
files are opened in an older version directory. Both of 
these operations are very fast as they are purely metadata 
operations: no data reads take place. 


5.4 Data Load 


Figure 4 shows the complete data loading and swapping 
process for an individual store. Multiple stores can run 
this entire process concurrently. 

The initiator of this complete construction is a stan- 
dalone driver program that constructs, fetches, and swaps 
the data. This program starts the process by triggering the 
Hadoop job described in Section 5.2. The job generates 
the data on a per-node basis and stores it in HDFS. While 
streaming the data to HDFS, the Hadoop job also calcu- 
lates a checksum on a per-node basis by storing a running 
MDS on the individual MD5s of all the chunk set files. 
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Once the Hadoop job is complete, the driver triggers 
a fetch request on all the Voldemort nodes. This re- 
quest is received by each node’s “administrative service,” 
which then initiates a parallel fetch from HDFS into its 
respective new version directory. While the data is being 
streamed from HDFS, the checksum is validated with 
the checksum from the build step. Voldemort uses a pull 
model, rather than a push model, as it allows throttling of 
this fetch in case of latency spikes. 

After the data is available on each node in their new 
version directory, the driver triggers a swap operation (de- 
scribed in Section 5.3) on all nodes. On one of LinkedIn’s 
largest clusters, described in Table 3, this complete oper- 
ation takes around 0.012 ms on average with the worst 
swap time of around 0.050 ms. Also, to provide global 
atomic semantics, the driver ensures that all the nodes 
have successfully swapped their data, rolling back the 
successful swaps in case of any other swap failures. 


5.5 Retrieval 


To find a key, the client generates the preference list and 
directs the request to the individual nodes. The following 
is a Sketch of the algorithm to find data once it reaches a 
particular node. 


1. Calculate the MD5 of the key. 

2. Generate the (a) primary partition id, (b) replica id 
(the replica being searched when querying this node), 
and (c) chunk set id (the first 4 bytes of MD5 of the 
key modulo the number of chunk sets per bucket). 

3. Find the corresponding active chunk set files (a data 
file and an index file) using the 3 variables from the 
previous step. 

4. Perform a search using the top 8 bytes of MD5 of 
the key as the search key in the sorted index file. 
Because there are fixed space requirements for every 
key (12 bytes: 8 bytes for key and 4 bytes for offset), 
this search does not require internal pointers within 
the index file. For example, the data location of the 
a-th element in the sorted index is simply a jump to 
the offset 12-2+8. 

5. If found, read the corresponding data location from 
the index file and jump to the location in the data 
file. Iterate through any potential collided tuples, 
comparing keys, and return the corresponding value 
on key match. 


The most time-consuming step is to search the index 
file. A binary search in an index of 1 million keys can 
result in around 20 key comparisons; if the index file is 
not cached, then 20 disk seeks are required to read one 
value. As a small optimization, while fetching the files 
from HDFS, Voldemort fetches the index files after all 
data files to aid in keeping the index files in the page 
cache. 
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Rather than binary search, another retrieval strategy for 
sorted disk files is interpolation search [17]. This search 
strategy uses the key distribution to predict the approxi- 
mate location of the key, rather than halving the search 
space for every iteration. Interpolation search works well 
for uniformly distributed keys, dropping the search com- 
plexity from O(log N) to O(log log N ). This helps in the 
uncached scenario by reducing the number of disk seeks. 

We also evaluated other strategies like Fast [12] and Pe- 
gasus [8]. As proved in Manolopoulos and Poulakas [13], 
most of these are better suited for non-uniform distribu- 
tions. As MD5 (and its subsets) provides a fairly represen- 
tative uniform distribution, there will be minimal speedup 
from these techniques. 


5.6 Schema Upgrades 


As product features evolve, there are bound to be changes 
to the underlying data model. For example, an admin- 
istrator may want to add a new dimension to a store’s 
value or do a complete non-backwards compatible change 
from storing an array to a map. Because our data is static 
and the system does a full refresh, Voldemort supports 
the ability to change the schema of the key and value 
without downtime. For the client to transparently handle 
this change, the binary JSON serialization format adds a 
special version byte during serialization. The mapping 
of version byte to schema is saved in the store defini- 
tion metadata. The updated store definition metadata can 
be propagated to clients by forcing a rebootstrap. Intro- 
duction of a new schema after a push is now discovered 
by the client during deserialization as it can look up the 
new information after reading the version byte. Similarly, 
during rollback, the client toggles to an older version of 
schema and is able to read the data with no downtime. 


5.7 Rebalancing 


Over time as new stores get added to the cluster, the 
disk to memory ratio increases beyond initial capacity 
planning, resulting in increased read latency. Our data 
being static, the naive approach of starting a new larger 
cluster, repushing the data, and switching clients does 
not scale as it requires massive coordination of multiple 
clients communicating with many stores. 

This necessitates the need to transparently and incre- 
mentally add capacity to the cluster independent of data 
pushes. The rebalancing feature allows us to add new 
nodes to a live cluster without downtime. This feature was 
initially written for read-write stores but easily fits into the 
read-only cycle due to the static nature and fine-grained 
replication of the data. Our smallest unit of rebalancing 
is a partition. In other words, the addition of a new node 
translates to giving the ownership of some partitions to 
that node. The rebalancing process is run by a tool that 
coordinates the full process. 
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The following describes the rebalancing strategy during 
the addition of a new node. First, the rebalancing tool is 
provided with the future cluster topology metadata, and 
with this data, it generates a list of all primary partitions 
that need to be moved. The tool moves partitions in small 
batches so as to checkpoint and not refetch too much data 
in case of failure. 

For every small batch of primary partitions, the sys- 
tem generates an intermediate cluster topology metadata, 
which is the current cluster topology plus changes in own- 
ership of the batch of partitions moved. Voldemort must 
take care of all secondary replica movements that might 
be required due to the primary partition movement. A 
plan is generated that lists the set of donating and steal- 
ing node-id pairs along with the chunk buckets being 
moved. With this plan, the rebalancing tool starts asyn- 
chronous processes (through the administrative service) 
on the stealer nodes to copy all chunk sets corresponding 
to the moving chunk buckets from their respective donor 
nodes. Rebalancing works only on the active version of 
the data, ignoring the previous versions. During this copy- 
ing, the nodes go into a “rebalancing state” and are not 
allowed to swap any new data. Here it is important to 
note that the granularity of the bucket selected makes this 
process as simple as copying files. If buckets were defined 
on a per-node basis (that is, have multiple chunk sets on a 
per-node basis), the system would have had to iterate over 
all the keys on the node and find the keys belonging to 
the moving partition, finally running an extra merge step 
to coalesce with the live index on the stealer node’s end. 

Once the fetches are complete, the rebalancing tool 
updates the intermediate cluster topology on all the nodes 
while also running the swap operation, described in Sec- 
tion 5.3, for all the stores on the stealer and donor nodes. 
The entire process repeats for every batch of primary 
partitions. 

The intermediate topology change also needs to be 
propagated to all the clients. Voldemort propagates this 
information as a lazy process where the clients still use 
the old metadata. If they contact a node with a request for 
a key in a partition that the node is no longer responsible 
for, a special exception is propagated, which results in a 
rebootstrap step along with a retry of the previous request. 

The rebalancing tool has also been designed to handle 
failure scenarios elegantly. Failure during a fetch is not 
a problem as no new data has been swapped. However, 
failure during the topology change and swap phase on 
some nodes requires (a) changing the cluster topology 
to the previous good cluster topology on all nodes and 
(b) rolling back the data on nodes that had successfully 
swapped. 

Table 2a shows the new preference list generation when 
a new node is introduced to a cluster with the original 
partition mapping as in Figure 3. For simplicity, we show 
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Table 2: (a) Change of preference list generation after addition of 
4th node (node id 3) to the cluster defined by ring in Figure 3. The 
highlighted cells show how moving partition 3 to this new node results 
in secondary movement of keys hashing to partition 2. (b) Rebalancing 
plan generated for addition of a new node. 


an imbalanced move of only one partition, partition 3, to 
the new node 3. Table 2b shows the plan that would be 
generated during rebalancing. The movement of partition 
3 results in secondary movement for partition 2 due to 
node mapping changes in its preference list. 


6 Evaluation 


Our evaluation answers the following questions: 


e Can the system rapidly deploy new data sets? 

e What is the read latency, and does it scale with data 

size and nodes? 

e What is the impact on latency during a new data 

deployment? 

We use a simulated data set where the key is a long 
integer between 0 and a varying number and the value is 
a fixed size 1024 byte random string. All tests were run 
on Linux 2.6.18 machines with Dual CPU (each having 
64-bit 8 cores running at 2.67 GHz), 24 GB of RAM, 6 
disk RAID-10 array and Gigabit Ethernet. We used Com- 
munity Edition version 5.0.27 and the MyISAM storage 
engine for all the MySQL tests. 

As the read-only storage engine relies on the operating 
system’s page cache, we allocated only 4 GB JVM heap. 
Similarly, as MyISAM uses a special key cache for index 
blocks and the page cache for data blocks, we chose the 
same 4 GB for key_buffer_size. 


6.1 Build Times 


One of the important goals of Voldemort is rapid data 
deployment, which means the build and push phase must 
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Figure 7: The time to complete the build for the random data set. 
We vary the input data size by increasing the number of tuples. We 
terminated the MySQL test early due to prolonged data load time. 


be fast. Push times are entirely dependent on available 
network bandwidth, so we focus on build times. 

We define the build time in the case of Voldemort as 
the time starting from the first mapper to the end of all 
reducers. The number of mappers and reducers was fixed 
across runs to steady the amount of parallelism and gen- 
erate fixed number of chunk sets per bucket. 

In the case of MySQL, the build time is the comple- 
tion time of the LOAD DATA INFILE command on an 
empty table. This metric ignores the time it took to con- 
vert the data to TSV and copy it to the MySQL node. We 
applied several optimizations to make MySQL faster, in- 
cluding increasing the MySQL bulk insert buffer size and 
the MyISAM specific sort buffer size to 256 MB each, 
and also delaying the re-creation of the index to a lat- 
ter time by running the ALTER TABLE...DISABLE 
KEYS statement before the load. 

Figure 7 shows the build time as we increased the size 
of the input data set. As is clearly evident, MySQL 
exhibits extremely slow build times because it buffers 
changes to the index before flushing it to the disk. Also, 
due to the incremental changes required to the index on 
disk, MySQL does roughly 1.4 times more I/O than our 
implementation. This factor would increase if we had 
bulk loaded into a non-empty table. 


6.2 Read Latency 


Besides rapid data deployments, read latency must be 
acceptable and the system must scale with the number of 
nodes. In these experiments, we used 10 million requests 
with simulated keys following a uniform distribution be- 
tween O to number of tuples in the data set. 

We first measure how fast the index loads into the oper- 
ating system’s page cache. We ran tests on a 100 GB data 
set on a single node and reported the median latency after 
swap for a continuous stream of uniformly-distributed 
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Figure 8: Single node median read latency taken at | minute intervals 
since the swap. The distribution of requests is uniform. The slope of the 
graph shows the rate of cache warming. 
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Figure 9: Single node read latency after warming up the cache. This 
figure shows the change in latency, for uniformly-distributed requests, 
as we vary the client throughput. 


requests. For MySQL, we created a view on an exist- 
ing table, bulk loaded into a new table, and swapped the 
view to the new table without stopping the requests. For 
our read-only storage engine, we used the complete data 
load process (described in Section 5.4), to swap new data. 
The single node was configured to have just one partition 
and one chunk set per bucket. We also compared the bi- 
nary and interpolation search algorithms for the read-only 
storage engine. 


Figure 8 shows the median latency, at 1 minute inter- 
vals, starting from the swap. MySQL starts with a very 
high median latency due to the uncached index and falls 
slowly to the stable 1.4 ms mark. Our storage engine 
starts with low latency because some indexes are already 
page cache resident, with the fetch phase from HDFS 
retrieving all index files after the data files. Binary search 
initially starts with a high median latency compared to 
interpolation, but the slope of the line is steeper, because 
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Figure 10: Client-side median latency with varying data size. This test 
was run on a 32 node cluster with 2 different request distributions. 


binary search does an average of 8 lookups, thereby touch- 
ing more parts of the index; interpolation search performs 
an average of only 1 lookup. While this results in an 
initial low read latency, it means that much of the index 
is uncached in the long run. Our production systems are 
currently running binary search due to this faster cache 
warming process. All numbers presented from this point 
for the read-only storage engine use binary search. 


Figure 9 shows a comparison of Voldemort’s perfor- 
mance compared to MySQL on the same 100 GB data set 
for varying throughput. The numbers reported are steady- 
state latencies; that is, latency reported after the cache 
is warmed. For comparison, the steady state latency for 
the read-only storage engine in Figure 9 is around 0.3 ms 
and is achieved around 90 minutes after the swap. We 
observed that the time to achieve this steady state, starting 
from the swap time, is linear in the size of the data set. We 
increased the client request throughput until the achieved 
throughput stopped increasing. These results indicate that 
our implementation scales to roughly twice the number 
of queries per second while maintaining the same median 
latency as MySQL. 


To test whether our read-only extensions scale with the 
number of nodes, we evaluated read latency for the same 
random data set but spread over 32 machines and a store 
with N=1. The read tests were run for both uniform as 
well as a Zipfian distribution using YCSB [5], an open 
source framework for benchmarking cloud data serving 
systems, with the number of clients fixed at 100 threads. 
The Zipfian distribution ensures that some keys are more 
frequently accessed compared to others, simulating the 
general site visiting patterns of most websites [1]. Fig- 
ure 10 shows the overall client-side median latency while 
varying the data set sizes. Querying for frequently ac- 
cessed keys naturally aids caching certain sections of the 
indexes, thereby exhibiting an overall lower latency for 
Zipfian compared to the uniform distribution. We do not 


Number of nodes 

Total (active + backup) data size per node 
RAM per node 

Current active data to memory ratio 


Total number of stores 

Replication factor for all stores 
Largest store size (active) 

Smallest store size (active) 

Max number of store swaps per day 





Table 3: Statistics for one of LinkedIn’s read-only clusters. 


report numbers for a store with N>1 because latency is a 
function of data size and is independent of the replication 
factor. The results indicate that the system scales with the 
data set size and the number of nodes. As the data set size 
increases, we are decreasing the memory to data ratio, 
affecting read performance. Reducing latency in this case 
would require adding memory or additional nodes. Users 
can tune this ratio to achieve the desired latency versus 
the necessary hardware footprint. 


6.3. Production Workloads 


Finally, we show the production performance data for two 
user-facing features: “People You May Know” (Figure 1a) 
and collaborative filtering (Figure 1b): 


e People You May Know (PYMK) data set: Users are 
presented with a suggested set of other users they 
might know and would like to connect with. This 
information is kept as a store where the key is the 
user’s id and the value is a list of integer recom- 
mended user ids and a float score. 

e Collaborative filtering (CF) data set: This feature 
shows other profiles viewed in the same session as 
the visited member’s profile. The value is a list of 
two integer ids, a string indicating the entity type, 
and a float score. 

Table 3 shows some statistics for one of LinkedIn’s 
largest clusters. Figure 11 shows the PYMK and CF 
median client-side read latencies as a function of time 
since a swap on this cluster (both stores use N=2 and 
R=1) for one high traffic day. CF has a higher latency 
than that of PYMK primarily because of the larger value 
size. We see sub-12 ms latency immediately after a swap 
with relatively quick stabilization to sub-5 ms latency. 
This low latency post-swap allows us to push updates to 
these features multiple times per day. 


7 Conclusion and Future Work 


In this paper, we present a low-latency bulk loading sys- 
tem capable of serving multiple terabytes of data. By 
moving the index construction offline to a batch system 
like Hadoop, our serving layer’s performance becomes 
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Figure 11: Median client-side read latency for one of LinkedIn’s largest 
production clusters, as described in Table 3, for the (a) PYMK and (b) 
CF data sets. The dashed line shows the time when the new data set was 
swapped in. 
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more stable and reliable. LinkedIn has been successfully 
running read-only Voldemort clusters for the past 2 years. 
It has become an integral part of the product ecosystem 
with various engineers also using it frequently for quick 
prototyping of features. The complete system is open- 
source and freely available. 

We plan to add other interesting features to the read- 
only storage pipeline. Over time we have found that 
during fetches we exhaust the full bandwidth between 
data centers running Hadoop (in particular HDFS) and 
Voldemort. We therefore need improvements to the push 
process to reduce network usage with minimal impact on 
build time. 

To start with, we are exploring incremental loads. This 
can be done by generating data file patches on Hadoop by 
comparing against the previous data snapshot in HDFS 
and then applying these on the Voldemort side during the 
fetch phase. We can send the complete index files because 
(a) they are relatively small files and (b) we can exploit 
the operating system caching of these files during the 
fetch phase. This capability has seen few use cases until 
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recently as most of our stores back recommendation fea- 
tures where the delta between data pushes is prohibitively 
large. Another important feature to save inter-data cen- 
ter bandwidth is the ability to only fetch one replica of 
the data from HDFS and then propagate it among the 
Voldemort nodes. 

Finally, we are investigating additional index structures 
that could improve lookup speed and that can easily be 
built in Hadoop. In particular, cache-oblivious trees, such 
as van Emde Boas trees [23], require no page size knowl- 
edge for optimal cache performance. 
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Abstract 


We present BlueSky, a network file system backed by 
cloud storage. BlueSky stores data persistently in a cloud 
storage provider such as Amazon S3 or Windows Azure, 
allowing users to take advantage of the reliability and 
large storage capacity of cloud providers and avoid the 
need for dedicated server hardware. Clients access the 
storage through a proxy running on-site, which caches 
data to provide lower-latency responses and additional 
opportunities for optimization. We describe some of the 
optimizations which are necessary to achieve good per- 
formance and low cost, including a log-structured design 
and a secure in-cloud log cleaner. BlueSky supports mul- 
tiple protocols—both NFS and CIFS—and is portable to 
different providers. 


1 Introduction 


The promise of third-party “cloud computing” services 1s 
a trifecta of reduced cost, dynamic scalability, and high 
availability. While there remains debate about the precise 
nature and limit of these properties, it is difficult to deny 
that cloud services offer real utility—evident in the large 
numbers of production systems now being cloud-hosted 
via services such as Amazon’s AWS and Microsoft’s 
Azure. However, thus far, services hosted in the cloud 
have largely fallen into two categories: consumer-facing 
Web applications (e.g., Netflix customer Web site and 
streaming control) and large-scale data crunching (e.g., 
Netflix media encoding pipeline). 

Little of this activity, however, has driven widespread 
outsourcing of enterprise computing and storage applica- 
tions. The reasons for this are many and varied, but they 
largely reflect the substantial inertia of existing client- 
server deployments. Enterprises have large capital and 
operational investments in client software and depend on 
the familiar performance, availability and security char- 
acteristics of traditional server platforms. In essence, 
cloud computing is not currently a transparent “drop in” 
replacement for existing services. 


*Current affiliation: Google. The work in this paper was performed 
while a student at UC San Diego. 


There are also substantive technical challenges to 
overcome, as the design points for traditional client- 
server applications (e.g., file systems, databases, etc.) 
frequently do not mesh well with the services offered 
by cloud providers. In particular, many such applica- 
tions are designed to be bandwidth-hungry and latency- 
sensitive (a reasonable design in a LAN environment), 
while the remote nature of cloud service naturally in- 
creases latency and the cost of bandwidth. Moreover, 
while cloud services typically export simple interfaces 
to abstract resources (e.g., “put file’ for Amazon’s S3), 
traditional server protocols can encapsulate significantly 
more functionality. Thus, until such applications are re- 
designed, much of the latent potential for outsourcing 
computing and storage services remains untapped. In- 
deed, at $115B/year, small and medium business (SMB) 
expenditures for servers and storage represent an enor- 
mous market should these issues be resolved [9]. Even 
if the eventual evolution is towards hosting all applica- 
tions in the cloud, it will be many years before such a 
migration is complete. In the meantime, organizations 
will need to support a mix of local applications and use 
of the cloud. 


In this paper, we explore an approach for bridging 
these domains for one particular application: network file 
service. In particular, we are concerned with the extent 
to which traditional network file service can be replaced 
with commodity cloud services. However, our design 
is purposely constrained by the tremendous investment 
(both in capital and training) in established file system 
client software; we take as a given that end-system soft- 
ware will be unchanged. Consequently, we focus on a 
proxy-based solution, one in which a dedicated proxy 
server provides the illusion of a single traditional file 
server in an enterprise setting, translating requests into 
appropriate cloud storage API calls over the Internet. 


We explore this approach through a prototype sys- 
tem, called BlueSky, that supports both NFS and CIFS 
network file system protocols and includes drivers for 
both the Amazon EC2/S3 environment and Microsoft’s 
Azure. The engineering of such a system faces a number 
of design challenges, the most obvious of which revolve 
around performance (1.e., caching, hiding latency, and 
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maximizing the use of Internet bandwidth), but less intu- 
itively also interact strongly with cost. In particular, the 
interaction between the storage interfaces and fee sched- 
ule provided by current cloud service providers conspire 
to favor large segment-based layout designs (as well 
as cloud-based file system cleaners). We demonstrate 
that ignoring these issues can dramatically inflate costs 
(as much as 30x in our benchmarks) without signifi- 
cantly improving performance. Finally, across a series 
of benchmarks we demonstrate that, when using such a 
design, commodity cloud-based storage services can pro- 
vide performance competitive with local file servers for 
the capacity and working sets demanded by enterprise 
workloads, while still accruing the scalability and cost 
benefits offered by third-party cloud services. 


2 Related Work 


Network storage systems have engendered a vast litera- 
ture, much of it focused on the design and performance 
of traditional client server systems such as NFS, AF%S, 
CIFS, and WAFL [6, 7, 8, 25]. Recently, a range of 
efforts has considered other structures, including those 
based on peer-to-peer storage [16] among distributed sets 
of untrusted servers [12, 13] which have indirectly in- 
formed subsequent cloud-based designs. 

Cloud storage is a newer topic, driven by the availabil- 
ity of commodity services from Amazon’s S3 and other 
providers. The elastic nature of cloud storage is reminis- 
cent of the motivation for the Plan 9 write-once file sys- 
tems [19, 20], although cloud communication overheads 
and monetary costs argue against a block interface and 
no storage reclamation. Perhaps the closest academic 
work to our own is SafeStore [11], which stripes erasure- 
coded data objects across multiple storage providers, ul- 
timately exploring access via an NFS interface. How- 
ever, SafeStore is focused clearly on availability, rather 
than performance or cost, and thus its design decisions 
are quite different. A similar, albeit more complex sys- 
tem, is DepSky [2], which also focuses strongly on avail- 
ability, proposing a “cloud of clouds” model to replicate 
across providers. 

At a more abstract level, Chen and Sion create an 
economic framework for evaluating cloud storage costs 
and conclude that the computational costs of the cryp- 
tographic operations needed to ensure privacy can over- 
whelm other economic benefits [3]. However, this work 
predates Intel’s AES-NI architecture extension which 
significantly accelerates data encryption operations. 

There have also been a range of non-academic at- 
tempts to provide traditional file system interfaces for the 
key-value storage systems offered by services like Ama- 
zon’s $3. Most of these install new per-client file system 
drivers. Exemplars include s3fs [22], which tries to map 
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the file system directly on to $3’s storage model (which 
both changes file system semantics, but also can dramat- 
ically increase costs) and ElasticDrive [5], which exports 
a block-level interface (potentially discarding optimiza- 
tions that use file-level knowledge such as prefetching). 

However, the systems closest to our own are “cloud 
storage gateways’, a new class of storage server that has 
emerged in the last few years (contemporaneous with our 
effort). These systems, exemplified by companies such 
as Nasuni, Cirtas, TwinStrata, StorSimple and Panzura, 
provide caching network file system proxies (or “gate- 
ways’) that are, at least on the surface, very similar to 
our design. Pricing schedules for these systems gener- 
ally reflect a 2 premium over raw cloud storage costs. 
While few details of these systems are public, in general 
they validate the design point we have chosen. 

Of commercial cloud storage gateways, Nasuni [17] 
is perhaps most similar to BlueSky. Nasuni provides a 
“virtual NAS appliance” (or “‘filer’), software packaged 
as a virtual machine which the customer runs on their 
own hardware—this is very much like the BlueSky proxy 
software that we build. The Nasuni filer acts as a cache 
and writes data durably to the cloud. Because Nasuni 
does not publish implementation details it is not possi- 
ble to know precisely how similar Nasuni is to BlueSky, 
though there are some external differences. In terms of 
cost, Nasuni charges a price based simply on the quantity 
of disk space consumed (around $0.30/GB/month, de- 
pending on the cloud provider)—and not at all a function 
of data transferred or operations performed. Presumably, 
Nasuni optimizes their system to reduce the network and 
per-operation overheads—otherwise those would eat into 
their profits—but the details of how they do so are un- 
clear, other than by employing caching. 

Cirtas [4] builds a cloud gateway as well but sells it 
in appliance form: Cirtas’s Bluejet is a rack-mounted 
computer which integrates software to cache file system 
data with storage hardware in a single package. Cirtas 
thus has a higher up-front cost than Nasuni’s product, 
but is easier to deploy. Panzura [18] provides yet another 
CIFS/NFS gateway to cloud storage. Unlike BlueSky 
and the others, Panzura allows multiple customer sites 
to each run a cloud gateway. Each of these gateways ac- 
cesses the same underlying file system, so Panzura 1s par- 
ticularly appropriate for teams sharing data over a wide 
area. But again, implementation details are not provided. 

TwinStrata [29] and StorSimple [28] implement gate- 
ways that present a block-level storage interface, like 
ElasticDrive, and thus lose many potential file system- 
level optimizations as well. 

In some respects BlueSky acts like a local storage 
server that backs up data to the cloud—a local NFS 
server combined with Mozy [15], Cumulus [30], or sim- 
ilar software could provide similar functionality. How- 
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ever, such backup tools may not support a high backup 
frequency (ensuring data reaches the cloud quickly) and 
efficient random access to files in the cloud. Further, they 
treat the local data (rather than the cloud copy) as au- 
thoritative, preventing the local server from caching just 
a subset of the files. 


3 Architecture 


BlueSky provides service to clients in an enterprise us- 
ing a transparent proxy-based architecture that stores 
data persistently on cloud storage providers (Figure 1). 
The enterprise setting we specifically consider consists 
of a single proxy cache colocated with enterprise clients, 
with a relatively high-latency yet high-bandwidth link to 
cloud storage, with typical office and engineering request 
workloads to files totaling tens of terabytes. This sec- 
tion discusses the role of the proxy and cloud provider 
components, as well as the security model supported by 
BlueSky. Sections 4 and 5 then describe the layout and 
operation of the BlueSky file system and the BlueSky 
proxy, respectively. 

Cloud storage acts much like another layer in the stor- 
age hierarchy. However, it presents new design consid- 
erations that, combined, make it distinct from other lay- 
ers and strongly influence its use as a file service. The 
high latency to the cloud necessitates aggressive caching 
close to the enterprise. On the other hand, cloud storage 
has elastic capacity and provides operation service times 
independent of spatial locality, thus greatly easing free 
space management and data layout. Cloud storage inter- 
faces often only support writing complete objects in an 
operation, preventing the efficient update of just a portion 
of a stored object. This constraint motivates an append 
rather than an overwrite model for storing data. 

Monetary cost also becomes an explicit metric of 
optimization: cloud storage capacity might be elastic, 
but still needs to be parsimoniously managed to min- 
imize storage costs over time [30]. With an append 
model of storage, garbage collection becomes a neces- 
sity. Providers also charge a small cost for each opera- 
tion. Although slight, costs are sufficiently high to moti- 
vate aggregating small objects (metadata and small files) 
into larger units when writing data. Finally, outsourcing 
data storage makes security a primary consideration. 


3.1 Local Proxy 


The central component of BlueSky is a proxy situated 
between clients and cloud providers. The proxy commu- 
nicates with clients in an enterprise using a standard net- 
work file system protocol, and communicates with cloud 
providers using a cloud storage protocol. Our prototype 
supports both the NFS (version 3) and CIFS protocols for 
clients, and the RESTful protocols for the Amazon S3 
and Windows Azure cloud services. Ideally, the proxy 
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Figure 1: BlueSky architecture. 


runs in the same enterprise network as the clients to min- 
imize latency to them. The proxy caches data locally and 
manages sharing of data among clients without requiring 
an expensive round-trip to the cloud. 

Clients do not require modification since they continue 
to use standard file-sharing protocols. They mount Blue- 
Sky file systems exported by the proxy just as if they 
were exported from an NFS or CIFS server. Further, the 
same BlueSky file system can be mounted by any type of 
client with shared semantics equivalent to Samba. 

As described in more detail later, BlueSky lowers cost 
and improves performance by adopting a log-structured 
data layout for the file system stored on the cloud 
provider. A cleaner reclaims storage space by garbage- 
collecting old log segments which do not contain any live 
objects, and processing almost-empty segments by copy- 
ing live data out of old segments into new segments. 

As a write-back cache, the BlueSky proxy can fully 
satisfy client write requests with local network file sys- 
tem performance by writing to its local disk—as long as 
its cache capacity can absorb periods of write bursts as 
constrained by the bandwidth the proxy has to the cloud 
provider (Section 6.5). For read requests, the proxy can 
provide local performance to the extent that the proxy 
can cache the working set of the client read workload 
(Section 6.4). 


3.2 Cloud Provider 


So that BlueSky can potentially use any cloud provider 
for persistent storage service, it makes minimal assump- 
tions of the provider; in our experiments, we use both 
Amazon S3 and the Windows Azure blob service. Blue- 
Sky requires only a basic interface supporting get, put, 
list, and delete operations. If the provider also sup- 
ports a hosting service, BlueSky can co-locate the file 
system cleaner at the provider to reduce cost and improve 
cleaning performance. 


3.3. Security 


Security becomes a key concern with outsourcing critical 
functionality such as data storage. In designing BlueSky, 
our goal is to provide high assurances of data confiden- 
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tiality and integrity. The proxy encrypts all client data 
before sending it over the network, so the provider can- 
not read private data. Encryption is at the level of objects 
(inodes, file blocks, etc.) and not entire log segments. 
Data stored at the provider also includes integrity checks 
to detect any tampering by the storage provider. 

However, some trust in the cloud provider is unavoid- 
able, particularly for data availability. The provider can 
always delete or corrupt stored data, rendering it unavail- 
able. These actions could be intentional—e.g., if the 
provider is malicious—or accidental, for instance due 
to insufficient redundancy in the face of correlated hard- 
ware failures from disasters. Ultimately, the best guard 
against such problems is through auditing and the use of 
multiple independent providers [2, 11]. BlueSky could 
readily incorporate such functionality, but doing so re- 
mains outside the scope of our current work. 

A buggy or malicious storage provider could also 
serve Stale data. Instead of returning the most recent data, 
it could return an old copy of a data object that nonethe- 
less has a valid signature (because it was written by the 
client at an earlier time). By authenticating pointers be- 
tween objects starting at the root, however, BlueSky pre- 
vents a provider from selectively rolling back file data. 
A provider can only roll back the entire file system to an 
earlier state, which customers will likely detect. 

BlueSky can also take advantage of computation in 
the cloud for running the file system cleaner. As with 
storage, we do not want to completely trust the compu- 
tational service, yet doing so provides a tension in the 
design. To maintain confidentiality, data encryption keys 
should not be available on cloud compute nodes. Yet, 
if cloud compute nodes are used for file system mainte- 
nance tasks, the compute nodes must be able to read and 
manipulate file system data structures. For BlueSky, we 
make the tradeoff of encrypting file data while leaving 
the metadata necessary for cleaning the file system un- 
encrypted. As a result, storage providers can understand 
the layout of the file system, but the data remains confi- 
dential and the proxy can still validate its integrity. 

In summary, BlueSky provides strong confidentiality 
and slightly weaker integrity guarantees (some data roll- 
back attacks might be possible but are largely prevented), 
but must rely on the provider for availability. 


4 BlueSky File System 


This section describes the BlueSky file system layout. 
We present the object data structures maintained in the 
file system and their organization in a log-structured for- 
mat. We also describe how BlueSky cleans the logs com- 
prising the file system, and how the design conveniently 
lends itself to providing versioned backups of the data 
stored in the file system. 
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4.1 Object Types 


BlueSky uses four types of objects for representing data 
and metadata in its log-structured file system [23] for- 
mat: data blocks, inodes, inode maps, and checkpoints. 
These objects are aggregated into log segments for stor- 
age. Figure 2 illustrates their relationship in the layout of 
the file system. On top of this physical layout BlueSky 
provides standard POSIX file system semantics, includ- 
ing atomic renames and hard links. 

Data blocks store file data. Files are broken apart into 
fixed-size blocks (except the last block may be short). 
BlueSky uses 32 KB blocks instead of typical disk file 
system sizes like 4 KB to reduce overhead: block point- 
ers as well as extra header information impose a higher 
per-block overhead in BlueSky than in an on-disk file 
system. In the evaluations in Section 6, we show the 
cost and performance tradeoffs of this decision. Noth- 
ing fundamental, however, prevents BlueSky from using 
variable-size blocks optimized for the access patterns of 
each file, but we have not implemented this approach. 

Inodes for all file types include basic metadata: own- 
ership and access control, timestamps, etc. For regu- 
lar files, inodes include a list of pointers to data blocks 
with the file contents. Directory entries are stored inline 
within the directory inode to reduce the overhead of path 
traversals. BlueSky does not use indirect blocks for lo- 
cating file data—inodes directly contain pointers to all 
data blocks (easy to do since inodes are not fixed-size). 

Inode maps list the locations in the log of the most 
recent version of each inode. Since inodes are not stored 
at fixed locations, inode maps provide the necessary level 
of indirection for locating inodes. 

A checkpoint object determines the root of a file sys- 
tem snapshot. A checkpoint contains pointers to the loca- 
tions of the current inode map objects. On initialization 
the proxy locates the most recent checkpoint by scan- 
ning backwards in the log, since the checkpoint is always 
one of the last objects written. Checkpoints are useful 
for maintaining file system integrity in the face of proxy 
failures, for decoupling cleaning and file service, and for 
providing versioned backup. 


4.2 Cloud Log 


For each file system, BlueSky maintains a separate log 
for each writer to the file system. Typically there are 
two: the proxy managing the file system on behalf of 
clients and a cleaner that garbage collects overwritten 
data. Each writer stores its log segments to a separate 
directory (different key prefix), so writers can make up- 
dates to the file system independently. 

Each log consists of a number of log segments, and 
each log segment aggregates multiple objects together 
into an approximately fixed-size container for storage 
and transfer. In the current implementation segments are 
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Figure 2: BlueSky filesystem layout. The top portion shows the logical organization. Object pointers are shown with 
solid arrows. Shaded objects are encrypted (but pointers are always unencrypted). The bottom of the figure illustrates 
how these log items are packed into segments stored in the cloud. 


up to about 4 MB, large enough to avoid the overhead 
of dealing with many small objects. Though the storage 
interface requires that each log segment be written in a 
single operation, typically cloud providers allow partial 
reads of objects. As aresult, BlueSky can read individual 
objects regardless of segment size. Section 6.6 quantifies 
the performance benefits of grouping data into segments 
and of selective reads, and Section 6.7 quantifies their 
cost benefits. 

A monotonically-increasing sequence number identi- 
fies each log segment within a directory, and a byte offset 
identifies a specific object in the segment. Together, the 
triple (directory, sequence number, offset) describes the 
physical location of each object. Object pointers also in- 
clude the size of the object; while not required this hint 
allows BlueSky to quickly issue a read request for the 
exact bytes needed to fetch the object. 

In support of BlueSky’s security goals (Section 3.3), 
file system objects are individually encrypted (with AES) 
and protected with a keyed message authentication code 
(HMAC-SHA-256) by the proxy before uploading to the 
cloud service. Each object contains data with a mix of 
protections: some data is encrypted and authenticated, 
some data is authenticated plain-text, and some data is 
unauthenticated. The keys for encryption and authenti- 
cation are not shared with the cloud, though we assume 
that customers keep a safe backup of these keys for dis- 
aster recovery. Figure 3 summarizes the fields included 
in objects. 

BlueSky generates a unique identifier (UID) for each 
object when the object is written into the log. The UID 
remains constant if an item is simply relocated to a new 
log position. An object can contain pointers to other 
objects—for example, an inode pointing to data blocks— 
and the pointer lists both the UID and the physical lo- 
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Figure 3: Data fields included in most objects. 


cation. A cleaner in the cloud can relocate objects and 
update pointers with the new locations; as long as the 
UID in the pointer and the object match, the proxy can 
validate that the data has not been tampered with. 


4.3. Cleaner 


As with any log-structured file system, BlueSky requires 
a file system cleaner to garbage collect data that has been 
overwritten. Unlike traditional disk-based systems, the 
elastic nature of cloud storage means that the file sys- 
tem can grow effectively unbounded. Thus, the cleaner 
is not necessary to make progress when writing out new 
data, only to reduce storage costs and defragment data 
for more efficient access. 

We designed the BlueSky cleaner so that it can run 
either at the proxy or on a compute instance within the 
cloud provider where it has faster, cheaper access to the 
storage. For example, when running the cleaner in Ama- 
zon EC2 and accessing storage in S3, Amazon does not 
charge for data transfers (though it still charges for op- 
erations). A cleaner running in the cloud does not need 
to be fully trusted—it will need permission to read and 
write cloud storage, but does not require the file system 
encryption and authentication keys. 
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The cleaner runs online with no synchronous interac- 
tions with the proxy: clients can continue to access and 
modify the file system even while the cleaner is running. 
Conflicting updates to the same objects are later merged 
by the proxy, as described in Section 5.3. 


4.4 Backups 


The log-structured design allows BlueSky to integrate 
file system snapshots for backup purposes easily. In fact, 
so long as a cleaner is never run, any checkpoint record 
ever written to the cloud can be used to reconstruct the 
state of the file system at that point in time. Though not 
implemented in our prototype, the cleaner or a snapshot 
tool could record a list of checkpoints to retain and pro- 
tect all required log segments from deletion. Those seg- 
ments could also be archived elsewhere for safekeeping. 


4.5  Miulti-Proxy Access 


In the current BlueSky implementation only a single 
proxy can write to the file system, along with the cleaner 
which can run in parallel. It would be desirable to have 
multiple proxies reading from and writing to the same 
BlueSky file system at the same time—either from a sin- 
gle site, to increase capacity and throughput, or from 
multiple sites, to optimize latency for geographically- 
distributed clients. 

The support for multiple file system logs in BlueSky 
should make it easier to add support for multiple concur- 
rent proxies. Two approaches are possible. Similar to 
Ivy [16], the proxies could be unsynchronized, offering 
loose consistency guarantees and assuming only a single 
site updates a file most of the time. When conflicting 
updates occur in the uncommon case, the system would 
present the user with multiple file versions to reconcile. 

A second approach is to provide stronger consistency 
by serializing concurrent access to files from multiple 
proxies. This approach adds the complexity of some 
type of distributed lock manager to the system. Since 
cloud storage itself does not provide the necessary lock- 
ing semantics, a lock manager would either need to run 
on a cloud compute node or on the proxies (ideally, dis- 
tributed across the proxies for fault tolerance). 

Exploring either option remains future work. 


5 BlueSky Proxy 


This section describes the design and implementation of 
the BlueSky proxy, including how it caches data in mem- 
ory and on disk, manages its network connections to the 
cloud, and indirectly cooperates with the cleaner. 


5.1 Cache Management 


The proxy uses its local disk storage to implement a 
write-back cache. The proxy logs file system write re- 
quests from clients (both data and metadata) to a journal 
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on local disk, and ensures that data is safely on disk be- 
fore telling clients that data is committed. Writes are 
sent to the cloud asynchronously. Physically, the journal 
is broken apart into sequentially-numbered files on disk 
(journal segments) of a few megabytes each. 

This write-back caching does mean that in the event of 
a catastrophic failure of the proxy—if the proxy’s storage 
is lost—that some data may not have been written to the 
cloud and will be lost. If the local storage is intact no data 
will be lost; the proxy will replay the changes recorded 
in the journal. Periodically, the proxy snapshots the file 
system state, collects new file system objects and any in- 
ode map updates into one or more log segments, and up- 
loads those log segments to cloud storage. Our prototype 
proxy implementation does not currently perform dedu- 
plication, and we leave exploring the tradeoffs of such an 
optimization for future work. 

There are tradeoffs in choosing how quickly to flush 
data to the cloud. Writing data to the cloud quickly mini- 
mizes the window for data loss. However, a longer time- 
out has advantages as well: it enables larger log segment 
sizes, and it allows overlapping writes to be combined. In 
the extreme case of short-lived temporary files, no data 
need be uploaded to the cloud. Currently the BlueSky 
proxy commits data as frequently as once every five sec- 
onds. BlueSky does not start writing a new checkpoint 
until the previous one completes, so under a heavy write 
load checkpoints may commit less frequently. 

The proxy keeps a cache on disk to satisfy many read 
requests without going to the cloud; this cache consists 
of old journal segments and log segments downloaded 
from cloud storage. Journal and log segments are dis- 
carded from the cache using an LRU policy, except that 
journal segments not yet committed to the cloud are kept 
pinned in the cache. At most half of the disk cache can be 
pinned in this way. The proxy sends HTTP byte-range 
requests to decrease latency and cost when only part of 
a log segment is needed. It stores partially-downloaded 
segments as sparse files in the cache. 


5.2 Connection Management 


The BlueSky storage backends reuse HTTP connections 
when sending and receiving data from the cloud; the 
CURL library handles the details of this connection pool- 
ing. Separate threads perform each upload or download. 
BlueSky limits uploads to no more than 32 segments con- 
currently, to limit contention among TCP sessions and to 
limit memory usage in the proxy (it buffers each segment 
entirely in memory before sending). 


5.3 


As discussed in Section 4.3, the proxy and the cleaner 
operate independently of each other. When the cleaner 
runs, it starts from the most recent checkpoint written by 
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merge_inode(inop, ino-): 
if inop.id = ind<.id: 
return ino. // No conflicting changes 
// Start with proxy version and merge cleaner changes 
INOm <— INOp; iINOm.id < fresh_uuid(); updated < false 
for iin [0...num_blocks(ino,) — 1): 
bp < inoy.blocks|i|; be < ino¢.blocks|i| 
if b..id = bp.id and b..loc F by.loc: 
// Relocated data by cleaner is current 
iNOm -blocks.append(b.); updated < true 
else: // Take proxy’s version of data block 
iNOm-blocks.append( by ) 
return (inom if updated else ino,) 


Figure 4: Pseudocode for the proxy algorithm that 
merges state for possibly divergent inodes. Subscripts 
p and « indicate state written by the proxy and cleaner, 
respectively; ,, 1s used for a candidate merged version. 


the proxy. The cleaner only ever accesses data relative 
to this file system snapshot, even if the proxy writes ad- 
ditional updates to the cloud. As a result, the proxy and 
cleaner each may make updates to the same objects (e.g., 
inodes) in the file system. Since reconciling the updates 
requires unencrypted access to the objects, the proxy as- 
sumes responsibility for merging file system state. 


When the cleaner finishes execution, it writes an up- 
dated checkpoint record to its log; this checkpoint record 
identifies the snapshot on which the cleaning was based. 
When the proxy sees a new checkpoint record from the 
cleaner, it begins merging updates made by the cleaner 
with its own updates. 


BlueSky does not currently support the general case 
of merging file system state from many writers, and only 
supports the special case of merging updates from a sin- 
gle proxy and cleaner. This case 1s straightforward since 
only the proxy makes logical changes to the file system 
and the cleaner merely relocates data. In the worst case, 
if the proxy has difficulty merging changes by the cleaner 
it can simply discard the cleaner’s changes. 


The persistent UIDs for objects can optimize the check 
for whether merging is needed. If both the proxy and 
cleaner logs use the same UID for an object, the cleaner’s 
version may be used. The UIDs will differ if the proxy 
has made any changes to the object, in which case the 
objects must be merged or the proxy’s version used. For 
data blocks, the proxy’s version is always used. For in- 
odes, the proxy merges file data block-by-block accord- 
ing to the algorithm shown in Figure 4. The proxy can 
similarly use inode map objects directly if possible, or 
write merged maps if needed. 

Figure 5 shows an example of concurrent updates by 
the cleaner and proxy. State (a) includes a file with four 
blocks, stored in two segments written by the proxy. 
At (b) the cleaner runs and relocates the data blocks. 
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Figure 5: Example of concurrent updates by cleaner and 
proxy, and the resulting merged state. 


Concurrently, in (c) the proxy writes an update to the 
file, changing the contents of block 4. When the proxy 
merges state in (d), it accepts the relocated blocks 1-3 
written by the cleaner but keeps the updated block 4. At 
this point, when the cleaner runs again it can garbage 
collect the two unused proxy segments. 


5.4 Implementation 


Our BlueSky prototype is implemented primarily in C, 
with small amounts of C++ and Python. The core Blue- 
Sky library, which implements the file system but not any 
of the front-ends, consists of 8500 lines of code (includ- 
ing comments and whitespace). BlueSky uses GLib for 
data structures and utility functions, libgcrypt for cryp- 
tographic primitives, and libs3 and libcurl for interaction 
with Amazon $3 and Windows Azure. 

Our NES server consists of another 3000 lines of code, 
not counting code entirely generated by the rpcgen RPC 
protocol compiler. The CIFS server builds on top of 
Samba 4, adding approximately 1800 lines of code in a 
new backend. These interfaces do not fully implement 
all file system features such as security and permissions 
handling, but are sufficient to evaluate the performance 
of the system. The prototype in-cloud file system cleaner 
is implemented in just 650 lines of portable Python code 
and does not depend on the BlueSky core library. 


6 Evaluation 


In this section we evaluate the BlueSky proxy proto- 
type implementation. We explore performance from the 
proxy to the cloud, the effect of various design choices 
on both performance and cost, and how BlueSky perfor- 
mance varies as a function of its ability to cache client 
working sets for reads and absorb bursts of client writes. 
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6.1 Experimental Setup 


We ran experiments on Dell PowerEdge R200 servers 
with 2.13 GHz Intel Xeon X3210 (quad-core) proces- 
sors, a 7200 RPM 80 GB SATA hard drive, and gigabit 
network connectivity (internal and to the Internet). One 
machine, with 4 GB of RAM, 1s used as a load generator. 
The second machine, with 8 GB of RAM and an addi- 
tional 1.5 TB 7200 RPM disk drive, acts as a standard 
file server or a BlueSky proxy. Both servers run Debian 
testing; the load generator machine is a 32-bit install (re- 
quired for SPECsfs) while the proxy machine uses a 64- 
bit operating system. For comparison purposes we also 
ran a few tests against a commercial NAS filer in pro- 
duction use by our group. We focused our efforts on two 
providers: Amazon’s Simple Storage Service (S3) [1] 
and Windows Azure storage [14]. For Amazon S3, we 
looked at both the standard US region (East Coast) as 
well as S3’s West Coast (Northern California) region. 
We use the SPECsfs2008 [27] benchmark in many of 
our performance evaluations. SPECsfs can generate both 
NFSv3 and CIFS workloads patterned after real-world 
traces. In these experiments, SPECsfs subjects the server 
to increasing loads (measured in operations per second) 
while simultaneously increasing the size of the working 
set of files accessed. Our use of SPECsfs for research 
purposes does not follow all rules for fully-compliant 
benchmark results, but should allow for relative compar- 
isons. System load on the load generator machine re- 
mains low, and the load generator is not the bottleneck. 
In several of the benchmarks, the load generator ma- 
chine mounts the BlueSky file system with the standard 
Linux NFS client. In Section 6.4, we use a synthetic load 
generator which directly generates NFS read requests 
(bypassing the kernel NFS client) for better control. 


6.2 Cloud Provider Bandwidth 


To understand the performance bounds on any imple- 
mentation and to guide our specific design, we measured 
the performance our proxy is able to achieve writing data 
to Amazon S3. Figure 6 shows that the BlueSky proxy 
has the potential to fully utilize its gigabit link to S3 
if it uses large request sizes and parallel TCP connec- 
tions. The graph shows the total rate at which the proxy 
could upload data to S3 for a variety of request sizes and 
number of parallel connections. Network round-trip time 
from the proxy to the standard $3 region, shown in the 
graph, is around 30 ms. We do not pipeline requests—we 
wait for confirmation for each object on a connection be- 
fore sending another one—so each connection is mostly 
idle when uploading small objects. Larger objects better 
utilize the network, but objects of one to a few megabytes 
are sufficient to capture most gains. A single connec- 
tion utilizes only a fraction of the total bandwidth, so to 
fully make use of the network we need multiple parallel 
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Figure 6: Measured aggregate upload performance to 
Amazon S3, as a function of the size of the objects up- 
loaded (x-axis) and number of parallel connections made 
(various curves). A gigabit network link is available. Full 
use of the link requires parallel uploads of large objects. 


TCP connections. These measurements helped inform 
the choice of 4 MB log segments (Section 4.1) and a pool 
size of 32 connections (Section 5.2). 

The S3 US-West data center is closer to our proxy lo- 
cation and has a correspondingly lower measured round- 
trip time of 12 ms. The round-trip time to Azure from our 
location was substantially higher, around 85 ms. Yet net- 
work bandwidth was not a bottleneck in either case, with 
the achievable bandwidth again approaching | Gbps. In 
most benchmarks, we use the Amazon US-West region 
as the default cloud storage service. 


6.3 Impact of Cloud Latency 


To underscore the impact latency can have on file sys- 
tem performance, we first run a simple, time-honored 
benchmark of unpacking and compiling a kernel source 
tree. We measure the time for three steps: (1) extract 
the sources for Linux 2.6.37, which consist of roughly 
400 MB in 35,000 files (a write-only workload); (2) 
checksum the contents of all files in the extracted sources 
(a read-only workload); (3) build an 1386 kernel using 
the default configuration and the — 74 flag for up to four 
parallel compiles (a mixed read/write workload). For a 
range of comparisons, we repeat this experiment on a 
number of system configurations. In all cases with a 
remote file server, we flushed the client’s cache by un- 
mounting the file system in between steps. 

Table 1 shows the timing results of the benchmark 
steps for the various system configurations. Recall that 
the network links client¢+proxy and proxy++S3 are both 
1 Gbps—the only difference is latency (12 ms from the 
proxy to BlueSky/S3-West and 30 ms to BlueSky/S3- 
East). Using a network file system, even locally, adds 
considerably to the execution time of the benchmark 


USENIX Association 


USENIX Association 


Unpack Check Compile 

Local file system 

warm client cache 0:30 0:02 3:05 

cold client cache 0:27 
Local NFS server 

warm server cache 10:50 0:26 4:23 

cold server cache 0:49 
Commercial NAS filer 

warm cache 2:18 3:16 4:32 
NES server in EC2 

warm server cache 65:39 26:26 74:11 
BlueSky/S3-West 

warm proxy cache 5:10 0:33 5:50 

cold proxy cache 26:12 7:10 

full segment 1:49 6:45 
BlueSky/S3-East 

warm proxy 5:08 0:35 a) 

cold proxy cache 57:26 8:35 

full segment 3:50 8:07 


Table 1: Kernel compilation benchmark times for various 
file server configurations. Steps are (1) unpack sources, 
(2) checksum sources, (3) build kernel. Times are given 
in minutes:seconds. Cache flushing and prefetching are 
only relevant in steps (2) and (3). 


compared to a local disk. However, running an NFS 
server in EC2 compared to running it locally increases 
execution times by a factor of 6-30 due to the high la- 
tency between the client and server and a workload with 
operations on many small files. In our experiments we 
use a local Linux NFS server as a baseline. Our commer- 
cial NAS filer does give better write performance than a 
Linux NFS server, likely due in part to better hardware 
and an NVRAM write cache. Enterprises replacing such 
filers with BlueSky on generic rack servers would there- 
fore experience a drop in write performance. 

The substantial impact latency can have on workload 
performance motivates the need for a proxy architec- 
ture. Since clients interact with the BlueSky proxy with 
low latency, BlueSky with a warm disk cache is able 
to achieve performance similar to a local NFS server. 
(In this case, BlueSky performs slightly better than NFS 
because its log-structured design is better-optimized for 
some write-heavy workloads; however, we consider this 
difference incidental.) With a cold cache, it has to read 
small files from S3, incurring the latency penalty of read- 
ing from the cloud. Ancillary prefetching from fetching 
full 4 MB log segments when a client requests data in 
any part of the segment greatly improves performance, 
in part because this particular benchmark has substantial 
locality; later on we will see that, in workloads with little 
locality, full segment fetches hurt performance. How- 
ever, execution times are still multiples of BlueSky with 
a warm cache. The differences in latencies between S3- 
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Figure 7: Read latency as a function of working set cap- 
tured by the proxy. Results are from a single run. 


West and S3-East for the cold cache and full segment 
cases again underscore the sensitivity to cloud latency. 

In summary, greatly masking the high latency to cloud 
storage—even with high-bandwidth connectivity to the 
storage service—requires a local proxy to minimize la- 
tency to clients, while fully masking high cloud latency 
further requires an effective proxy cache. 


6.4 Caching the Working Set 


The BlueSky proxy can mask the high latency overhead 
of accessing data on a cloud service by caching data close 
to clients. For what kinds of file systems can such a 
proxy be an effective cache? Ideally, the proxy needs to 
cache the working set across all clients using the file sys- 
tem to maximize the number of requests that the proxy 
can satisfy locally. Although a number of factors can 
make generalizing difficult, previous studies have esti- 
mated that clients of a shared network file system typi- 
cally have a combined working set that is roughly 10% 
of the entire file system in a day, and less at smaller time 
scales [24, 31]. For BlueSky to provide acceptable per- 
formance, it must have the capacity to hold this working 
set. As a rough back-of-the-envelope using this conser- 
vative daily estimate, a proxy with one commodity 3 TB 
disk of local storage could capture the daily working set 
for a 30 TB file system, and five such disks raises the file 
system size to 150 TB. Many enterprise storage needs 
fall well within this envelope, so a BlueSky proxy can 
comfortably capture working sets for such scenarios. 

In practice, of course, workloads are dynamic. Even 
if proxy cache capacity is not an issue, clients shift 
their workloads over time and some fraction of the client 
workload to the proxy cannot be satisfied by the cache. 
To evaluate these cases, we use synthetic read and write 
workloads, and do so separately because they interact 
with the cache in different ways. 

We start with read workloads. Reads that hit in the 
cache achieve local performance, while reads that miss 
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in the cache incur the full latency of accessing data in the 
cloud, stalling the clients accessing the data. The ratio of 
read hits and misses in the workload determines overall 
read performance, and fundamentally depends on how 
well the cache capacity is able to capture the file system 
working set across all clients in steady state. 

We populate a BlueSky file system on S3 with 32 GB 
of data using 16 MB files.' We then generate a steady 
stream of fixed-size NFS read requests to random files 
through the BlueSky proxy. We vary the size of the proxy 
disk cache to represent different working set scenarios. 
In the best case, the capacity of the proxy cache is large 
enough to hold the entire working set: all read requests 
hit in the cache in steady state, minimizing latency. In 
the worst case, the cache capacity is zero, no part of the 
working set fits in the cache, and all requests go to the 
cloud service. In practice, a real workload falls in be- 
tween these extremes. Since we make uniform random 
requests to any of the files, the working set is equivalent 
to the size of the entire file system. 

Figure 7 shows that BlueSky with S3 provides good 
latency even when it is able to cache only 50% of the 
working set: with a local NFS latency of 21 ms for 32 KB 
requests, BlueSky is able to keep latency within 2 that 
value. Given that cache capacity is not an issue, this sit- 
uation corresponds to clients dramatically changing the 
data they are accessing such that 50% of their requests 
are to new data objects not cached at the proxy. Larger 
requests take better advantage of bandwidth: 1024 KB 
requests are 32 x larger than the 32 KB requests, but have 
latencies only 4x longer. 


6.5 Absorbing Writes 


The BlueSky proxy represents a classic write-back cache 
scenario in the context of a cache for a wide-area stor- 
age backend. In contrast to reads, the BlueSky proxy can 
absorb bursts of write traffic entirely with local perfor- 
mance since it implements a write-back cache. Two fac- 
tors determine the proxy’s ability to absorb write bursts: 
the capacity of the cache, which determines the instan- 
taneous size of a burst the proxy can absorb; and the 
network bandwidth between the proxy and the cloud ser- 
vice, which determines the rate at which the proxy can 
drain the cache by writing back data. As long as the write 
workload from clients falls within these constraints, the 
BlueSky proxy can entirely mask the high latency to the 
cloud service for writes. However, if clients instanta- 
neously burst more data than can fit in the cache, or if 
the steady-state write workload is higher than the band- 
width to the cloud, client writes start to experience delays 
that depend on the performance of the cloud service. 


'For this and other experiments, we use relatively small file system 
sizes to keep the time for performing experiments manageable. 
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Figure 8: Write latencies when the proxy is uploading 
over a constrained (~ 100 Mbps) uplink to S3 as a func- 
tion of the write rate of the client and the size of the write 
cache to temporarily absorb writes. 


We populate a BlueSky file system on S3 with 1 MB 
files and generate a steady stream of fixed-size 1 MB 
NFS write requests to random files in the file system. The 
client bursts writes at different rates for two minutes and 
then stops. So that we can overload the network between 
the BlueSky proxy and S3, we rate limit traffic to S3 at 
100 Mbps while keeping the client¢+proxy link unlim- 
ited at 1 Gbps. We start with a rate of write requests well 
below the traffic limit to S3, and then steadily increase 
the rate until the offered load is well above the limit. 

Figure 8 shows the average latency of the 1 MB write 
requests as a function of offered load, with error bars 
showing standard deviation across three runs. At low 
write rates the latency is determined by the time to com- 
mit writes to the proxy’s disk. The proxy can upload at 
up to about 12 MB/s to the cloud (due to the rate limit- 
ing), so beyond this point latency increases as the proxy 
must throttle writes by the client when the write buffer 
fills. With a 1 GB write-back cache the proxy can tem- 
porarily sustain write rates beyond the upload capacity. 
Over a 10 Mbps network (not shown), the write cache 
fills at correspondingly smaller client rates and latencies 
similarly quickly increase. 


6.6 More Elaborate Workloads 


Using the SPECsfs2008 benchmark we next examine the 
performance of BlueSky under more elaborate workload 
scenarios, both to subject BlueSky to more interesting 
workload mixes as well as to highlight the impact of 
different design decisions in BlueSky. We evaluate a 
number of different system configurations, including a 
native Linux nfsd in the local network (Local NFS) as 
well as BlueSky communicating with both Amazon S3’s 
US-West region and Windows Azure’s blob store. Un- 
less otherwise noted, BlueSky evaluation results are for 
communication with Amazon S3. In addition to the base 
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Figure 9: Comparison of various file server configurations subjected to the SPECsfs benchmark, with a low degree of 
parallelism (4 client processes). All BlueSky runs use cryptography, and most use Amazon US-West. 


BlueSky configuration, we test a number of variants: dis- 
abling the log-structured design to store each object in- 
dividually to the cloud (noseg), disabling range requests 
on reads so that full segments must be downloaded (no- 
range), and using 4 KB file system blocks instead of the 
default 32 KB (4K). The “noseg” case is meant to allow 
a rough comparison with BlueSky had it been designed 
to store file system objects directly to the cloud (without 
entirely reimplementing it). 

We run the SPECsfs benchmark in two different sce- 
narios, modeling both low and high degrees of client par- 
allelism. In the low-parallelism case, 4 client processes 
make requests to the server, each with at most 2 outstand- 
ing reads or writes. In the high-parallelism case, there are 
16 client processes each making up to 8 reads or writes. 


Figure 9 shows several SPECsfs runs under the low- 
parallelism case. In these experiments, the BlueSky 
proxy uses an 8 GB disk cache. The left graph shows the 
delivered throughput against the load offered by the load 
generator, and the right graph shows the corresponding 
average latency for the operations. At a low requested 
load, the file servers can easily keep up with the requests 
and so the achieved operations per second are equal to 
the requested load. As the server becomes saturated the 
achieved performance levels off and then decreases. 


The solid curve corresponds to a local NFS server 
using one of the disks of the proxy machine for stor- 
age. This machine can sustain a rate of up to 420 op- 
erations/sec, at which point the disk is the performance 
bottleneck. The BlueSky server achieves a low latency— 
comparable to the local server case—at low loads since 
many operations hit in the proxy’s cache and avoid wide- 
area network communication. At higher loads, perfor- 
mance degrades as the working set size increases. In 
write-heavy workloads, BlueSky incidentally performs 
better than the native Linux NFS server with local disk, 
since BlueSky commits operations to disk in a single 


journal and can make better use of disk bandwidth. Fun- 
damentally, though, we consider using cloud storage suc- 
cessful as long as it provides performance commensurate 
with standard local network file systems. 


BlueSky’s aggregation of written data into log seg- 
ments, and partial retrieval of data with byte-range re- 
quests, are important to achieving good performance and 
low cost with cloud storage providers. As discussed in 
Section 6.2, transferring data as larger objects is impor- 
tant for fully utilizing available bandwidth. As we show 
below, from a cost perspective larger objects are also bet- 
ter since small objects require more costly operations to 
store and retrieve an equal quantity of data. 


In this experiment we also used Windows Azure as 
the cloud provider. Although Azure did not perform as 
well as S3, we attribute the difference primarily to the 
higher latency (85 ms RTT) to Azure from our proxy 
location (recall that we achieved equivalent maximum 
bandwidths to both services). 


Figure 10 shows similar experiments but with a high 
degree of client parallelism. In these experiments, the 
proxy is configured with a 32 GB cache. To simulate 
the case in which cryptographic operations are better- 
accelerated, cryptography is disabled in most experi- 
ments but re-enabled in the “+crypto” experimental run. 
The “100 Mbps” test is identical to the base BlueSky 
experiment except that bandwidth to the cloud is con- 
strained to 100 Mbps instead of 1 Gbps. Performance is 
comparable at first, but degrades somewhat and is more 
erratic under more intense workloads. Results in these 
experimental runs are similar to the low-parallelism case. 
The servers achieve a higher total throughput when there 
are more concurrent requests from clients. In the high- 
parallelism case, both BlueSky and the local NFS server 
provide comparable performance. Comparing cryptogra- 
phy enabled versus disabled, again there is very little dif- 
ference: cryptographic operations are not a bottleneck. 
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Figure 10: Comparison of various file server configurations subjected to the SPECsfs benchmark, with a high degree 
of parallelism (16 client processes). Most tests have cryptography disabled, but the “+crypto” test re-enables it. 


Down Op Total (Up) 
Baseline $0.18 $0.09 $0.27 $0.56 
4 KB blocks 0.09 0.07 0.16 0.47 
Full segments 25.11 0.09 25.20 1.00 
No segments 0.17 2.91 3.08 0.56 


Table 2: Cost breakdown and comparison of various 
BlueSky configurations for using cloud storage. Costs 
are normalized to the cost per one million NFS opera- 
tions in SPECsfs. Breakdowns include traffic costs for 
uploading data to S3 (Up), downloading data (Down), 
Operation costs (Op), and their sum (Total). Amazon 
eliminated “Up” costs in mid-2011, but values using the 
old price are still shown for comparison. 


6.7 Monetary Cost 


Offloading file service to the cloud introduces monetary 
cost as another dimension for optimization. Figure 9 
showed the relative performance of different variants of 
BlueSky using data from the low-parallelism SPECsfs 
benchmark runs. Table 2 shows the cost breakdown 
of each of the variants, normalized per SPECsfs opera- 
tion (since the benchmark self-scales, different experi- 
ments have different numbers of operations). We use the 
September 2011 prices (in US Dollars) from Amazon S3 
as the basis for the cost analysis: $0.14/GB stored per 
month, $0.12/GB transferred out, and $0.01 per 10,000 
get or 1,000 put operations. S3 also offers cheaper price 
tiers for higher use, but we use the base prices as a worst 
case. Overall prices are similar for other providers. 
Unlike performance, Table 2 shows that comparing by 
cost changes the relative ordering of the different system 
variants. Using 4 KB blocks had very poor performance, 
but using them has the lowest cost since they effectively 
transfer only data that clients request. The BlueSky base- 
line uses 32 KB blocks, requiring more data transfers 
and higher costs overall. If a client makes a 4 KB re- 
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quest, the proxy will download the full 32 KB block; 
many times downloading the full block will satisfy fu- 
ture client requests with spatial locality, but not always. 
Finally, the range request optimization is essential in re- 
ducing cost. When the proxy downloads an entire 4 MB 
segment when a client requests any data in it, the cost for 
downloading data increases by 150. If providers did 
not support range requests, BlueSky would have to use 
smaller segments in its file system layout. 

Although 4 KB blocks have the lowest cost, we argue 
that using 32 KB blocks has the best cost-performance 
tradeoff. The costs with 32 KB clocks are higher, but the 
performance of 4 KB blocks is far too low for a system 
that relies upon wide-area transfers 


6.8 Cleaning 


As with other file systems that do not overwrite in place, 
BlueSky must clean the file system to garbage collect 
overwritten data—although less to recover critical stor- 
age space, and more to save on the cost of storing unnec- 
essary data at the cloud service. Recall that we designed 
the BlueSky cleaner to operate in one of two locations: 
running on the BlueSky proxy or on a compute instance 
in the cloud service. Cleaning in the cloud has com- 
pelling advantages: it 1s faster, does not consume proxy 
network bandwidth, and is cheaper since cloud services 
like S3 and Azure do not charge for local network traffic. 

The overhead of cleaning fundamentally depends on 
the workload. The amount of data that needs to be read 
and written back depends on the rate at which existing 
data is overwritten and the fraction of live data in cleaned 
segments, and the time it takes to clean depends on both. 
Rather than hypothesize a range of workloads, we de- 
scribe the results of a simple experiment to detail how 
the cleaner operates. 

We populate a small BlueSky file system with 64 MB 
of data, split across 8 files. A client randomly writes, ev- 
ery few seconds, to a small portion (0.5 MB) of one of 
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Figure 11: Storage space consumed during a write ex- 
periment running concurrently with the cleaner. 


these files. Over the course of the experiment the client 
overwrites 64 MB of data. In parallel a cleaner runs to 
recover storage space and defragment file contents; the 
cleaner runs every 30 seconds, after the proxy incorpo- 
rates changes made by the previous cleaner run. In ad- 
dition to providing data about cleaner performance, this 
experiment validates the design that allows for safe con- 
current execution of both the proxy and cleaner. 

Figure 11 shows the storage consumed during this 
cleaner experiment; each set of stacked bars shows stor- 
age after a pass by the cleaner. At any point in time, 
only 64 MB of data is live in the file system, some of 
which (bottom dark bar) consists of data left alone by 
the cleaner and some of which (lighter gray bar) was 
rewritten by the cleaner. Some wasted space (lightest 
gray) cannot be immediately reclaimed; this space is ei- 
ther mixed useful data/garbage segments, or data whose 
relocation the proxy has yet to acknowledge. However, 
the cleaner deletes segments which it can establish the 
proxy no longer needs (white) to reclaim storage. 

This workload causes the cleaner to write large 
amounts of data, because a small write to a file can cause 
the entire file to be rewritten to defragment the contents. 
Over the course of the experiment, even though the client 
only writes 64 MB of data the cleaner writes out an ad- 
ditional 224 MB of data. However, all these additional 
writes happen within the cloud where data transfers are 
free. The extra activity at the proxy, to merge updates 
written by the cleaner, adds only 750 KB in writes and 
270 KB in reads. 

Despite all the data being written out, the cleaner is 
able to reclaim space during experiment execution to 
keep the total space consumption bounded, and when the 
client write activity finishes at the end of the experiment 
the cleaner can repack the segment data to eliminate all 
remaining wasted space. 


6.9 Client Protocols: NFS and CIFS 


Finally, we use the SPECsfs benchmark to confirm that 
the performance of the BlueSky proxy is independent of 
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Figure 12: Latencies for read operations in SPECsfs as a 
function of aggregate operations per second (for all op- 
erations) and working set size. 


the client protocol (NFS or CFS) that clients use. The 
experiments performed above use NFS for convenience, 
but the results hold for clients using CIFS as well. 

Figure 12 shows the latency of the read operations in 
the benchmark as a function of aggregate operations per 
second (for all operations) and working set size. Because 
SPECsfs uses different operation mixes for its NFS and 
CIFS workloads, we focus on the latency of just the read 
operations for a common point of comparison. We show 
results for NFS and CIFS on the BlueSky proxy (Sec- 
tion 5.4) as well as standard implementations of both pro- 
tocols (Linux NFS and Samba for CIFS, on which our 
implementation is based). For the BlueSky proxy and 
standard implementations, the performance of NFS and 
CIFS are broadly similar as the benchmark scales, and 
BlueSky mirrors any differences in the underlying stan- 
dard implementations. Since SPECsfs uses a working 
set much larger than the BlueSky proxy cache capacity 
in this experiment, BlueSky has noticeably higher laten- 
cies than the standard implementations due to having to 
read data from cloud storage rather than local disk. 


7 Conclusion 


The promise of “the cloud” is that computation and stor- 
age will one day be seamlessly outsourced on an on- 
demand basis to massive data centers distributed around 
the globe, while individual clients will effectively be- 
come transient access portals. This model of the fu- 
ture (ironically similar to the old “big iron” mainframe 
model) may come to pass at some point, but today there 
are many hundreds of billions of dollars invested in the 
last disruptive computing model: client/server. Thus, in 
the interstitial years between now and a potential future 
built around cloud infrastructure, there will be a need to 
bridge the gap from one regime to the other. 

In this paper, we have explored a solution to one such 
challenge: network file systems. Using a caching proxy 
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architecture we demonstrate that LAN-oriented worksta- 
tion file system clients can be transparently served by 
cloud-based storage services with good performance for 
enterprise workloads. However, we show that exploit- 
ing the benefits of this arrangement requires that design 
choices (even low-level choices such as storage layout) 
are directly and carefully informed by the pricing mod- 
els exported by cloud providers (this coupling ultimately 
favoring a log-structured layout with in-cloud cleaning). 
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Abstract 


To reduce storage overhead, cloud file systems are 
transitioning from replication to erasure codes. This pro- 
cess has revealed new dimensions on which to evalu- 
ate the performance of different coding schemes: the 
amount of data used in recovery and when performing 
degraded reads. We present an algorithm that finds the 
optimal number of codeword symbols needed for recov- 
ery for any XOR-based erasure code and produces re- 
covery schedules that use a minimum amount of data. 
We differentiate popular erasure codes based on this cri- 
terion and demonstrate that the differences improve I/O 
performance in practice for the large block sizes used in 
cloud file systems. Several cloud systems [15, 10] have 
adopted Reed-Solomon (RS) codes, because of their gen- 
erality and their ability to tolerate larger numbers of fail- 
ures. We define a new class of rotated Reed-Solomon 
codes that perform degraded reads more efficiently than 
all known codes, but otherwise inherit the reliability and 
performance properties of Reed-Solomon codes. 


1 Introduction 


Cloud file systems transform the requirements for era- 
sure codes because they have properties and workloads 
that differ from traditional file systems and storage ar- 
rays. Our model for a cloud file system using era- 
sure codes is inspired by Microsoft Azure [10]. It con- 
forms well with HDFS [8] modified for RAID-6 [14] 
and Google’s analysis of redundancy coding [15]. Some 
cloud file systems, such as Microsoft Azure and the 
Google File system, create an append-only write work- 
load using a large block size. Writes are accumulated and 
buffered until a block is full and then the block is sealed: 
it is erasure coded and the coded blocks are distributed to 
storage nodes. Subsequent reads to sealed blocks often 
access smaller amounts data than the block size, depend- 
ing upon workload [14, 46]. 


When examining erasure codes in the context of cloud 
file systems, two performance critical operations emerge. 
These are degraded reads to temporarily unavailable 
data and recovery from single failures. Although era- 
sure codes tolerate multiple simultaneous failures, single 
failures represent 99.75% of recoveries [44]. Recovery 
performance has always been important. Previous work 
includes architecture support [13, 21] and workload op- 
timizations for recovery [22, 48, 45]. However, it is par- 
ticularly acute in the cloud owing to scale. Massive sys- 
tems have frequent component failures so that recovery 
becomes part of regular operation [16]. 


Frequent and temporary data unavailability in the 
cloud results in degraded reads. In the period between 
failure and recovery, reads are degraded because they 
must reconstruct data from unavailable storage nodes us- 
ing erasure codes. This is by necessity a slower opera- 
tion than reading the data without reconstruction. Tem- 
porary unavailability dominates disk failures. Transient 
errors in which no data are lost account for more than 
90% of data center failures [15], owing to network par- 
titions, software problems, or non-disk hardware faults. 
For this reason, Google delays the recovery of failed stor- 
age nodes for 15 minutes. Temporary unavailability also 
arises systematically when software upgrades take stor- 
age nodes offline. In many data centers, software updates 
are a rolling, continuous process [9]. 


Only recently have techniques emerged to reduce the 
data requirements of recovering an erasure code. Two re- 
cent research projects have demonstrated how the RAID- 
6 codes RDP and EVENODD may recover from single 
disk failures by reading significantly smaller subsets of 
codeword symbols than the previous standard practice of 
recovering from the parity drive [51, 49]. Our contribu- 
tions to recovery performance generalize these results to 
all XOR-based erasure codes, analyze existing codes to 
differentiate them based on recovery performance, and 
experimentally verify that reducing the amount of data 
used in recovery translates directly into improved perfor- 
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mance for cloud file systems, but not for typical RAID 
array configurations. 

We first present an algorithm that finds the optimal 
number of symbols needed for recovering data from an 
arbitrary number of disk failures, which also minimizes 
the amount of data read during recovery. We include an 
analysis of single failures in RAID-6 codes that reveals 
that sparse codes, such as Blaum-Roth [5], Liberation 
[34] and Liber8tion [35], have the best recovery proper- 
ties, reducing data by about 30% over the standard tech- 
nique that recovers each row independently. We also an- 
alyze codes that tolerate three or more disk failures, in- 
cluding the Reed-Solomon codes used by Google [15] 
and Microsoft Azure [10]. 

Our implementation and evaluation of this algorithm 
demonstrates that minimizing recovery data translates di- 
rectly into improved I/O performance for cloud file sys- 
tems. For large stripe sizes, experimental results track the 
analysis and increase recovery throughput by 30%. How- 
ever, the algorithm requires the large stripes created by 
large sealed blocks in cloud file systems in order to amor- 
tize the seek costs incurred when reading non-contiguous 
symbols. This is in contrast to recovery of the smaller 
stripes used by RAID arrays and in traditional file sys- 
tems in which the streaming recovery of all data outper- 
forms our algorithm for stripe sizes below | MB. Prior 
work on minimizing recovery I/O [51, 49, 27] is purely 
analytic, whereas our work incorporates measurements 
of recovery performance. 

We also examine the amount of data needed to perform 
degraded reads and reveal that it can use fewer symbols 
than recovery. An analysis of RAID-6 and three disk 
failure codes shows that degraded read performance dif- 
ferentiates codes that otherwise have the same recovery 
properties. Reads that request less than a stripe of data 
make the savings more acute, as much as 50%. 

Reed-Solomon codes are particularly poor for de- 
graded reads in that they must always read all data disks 
and parity for every degraded read. This is problem- 
atic because RS codes are popular owing to their gen- 
erality and applicability to nearly all coding situations. 
We develop a new class of codes, Rotated Reed-Solomon 
codes, that exceed the degraded read performance of 
all other codes, but otherwise have the encoding perfor- 
mance and reliability properties of RS Codes. Rotated 
RS codes can be constructed for arbitrary numbers of 
disks and failures. 


2 Related Work 


Performance Metrics: Erasure codes have been eval- 
uated historically on a variety of metrics, such as the 
CPU impact of encoding and decoding [3, 11, 37], the 
penalty of updating small amounts of data [5, 26, 52] and 
the ability to reconfigure systems without re-encoding [3, 
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7, 26]. The CPU performance of different erasure codes 
can vary significantly. However, for all codes that we 
consider, encoding and decoding bandwidth is orders of 
magnitude faster than disk bandwidth. Thus, the dom- 
inant factor when sealing data is writing the erasure- 
coded blocks to disk, not calculating the codes. Simi- 
larly, when decoding either for recovery or for degraded 
reads, the dominant factor is reading the data. 

Updating small amounts of data is also not a con- 
cern in cloud file systems—the append-only write pattern 
and sealed blocks eliminate small writes in their entirety. 
System reconfiguration refers to changing coding param- 
eters: changing the stripe width or increasing/decreasing 
fault tolerance. This type of reconfigurability is less 1m- 
portant in clouds because each sealed block defines an 
independent stripe group, spread across cloud storage 
nodes differently than other sealed blocks. There is no 
single array of disks to be reconfigured. If the need for 
reconfiguration arises, each sealed block is re-encoded 
independently. 

There has been some work lowering I/O costs in 
erasure-coded systems. In particular, WEAVER [19], 
Pyramid [23] and Stepped Combination Codes [18] have 
all been designed to lower I/O costs on recovery. How- 
ever, all of these codes are non-MDS, which means that 
they do not have the storage efficiency that cloud stor- 
age systems demand. The REO RAID Engine [26] min- 
imizes I/O in erasure-coded storage systems; however, 
its focus is primarily on the effect of updates on storage 
systems of smaller scale. 


Cloud Storage Systems: The default storage policy in 
cloud file systems has become triplication (triple repli- 
cation), implemented in the Google File system [16] and 
adopted by Hadoop [8] and many others. Triplication has 
been favored because of its ease of implementation, good 
read and recovery performance, and reliability. 

The storage overhead of triplication 1s a concern, lead- 
ing system designers to consider erasure coding as an al- 
ternative. The performance tradeoffs between replication 
and erasure coding are well understood and have been 
evaluated in many environments, such as peer-to-peer file 
systems [43, 50] and open-source coding libraries [37]. 

Investigations into applying RAID-6 (two fault toler- 
ant) erasure codes in cloud file systems show that they 
reduce storage overheads from 200% to 25% at a small 
cost in reliability and the performance of large reads 
[14]. Microsoft research further explored the cost/benefit 
tradeoffs and expand the analysis to new metrics: power 
proportionality and complexity [53]. For these reasons, 
Facebook is evaluating RAID-6 and erasure codes in 
their cloud infrastructure [47]. Our work supports this 
trend, providing specific guidance as to the relative mer- 
its of different RAID-6 codes with a focus on recover- 
ability and degraded reads. 
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Ford et al. [15] have developed reliability models for 
Google’s cloud file system and validated models against 
a year of workload and failure data from the Google in- 
frastructure. Their analysis concludes that data place- 
ment strategies need to be aware of failure groupings and 
failure bursts. They also argue that, in the presence of 
correlated failures, codes more fault tolerant than RAID- 
6 are needed to to reduce exposure to data loss; they con- 
sider Reed-Solomon codes that tolerate three and four 
disk failures. Windows Azure storage employs Reed- 
Solomon codes for similar reasons [10]. The rotated RS 
codes that we present inherit all the properties of Reed- 
Solomon codes and improve degraded reads. 


Recovery Optimization: Workload-based approaches 
to improving recovery are independent of the choice of 
erasure code and apply to minimum I/O recovery algo- 
rithm and rotated RS codes that we present. These in- 
clude: load-balancing recovery among disks [22], recov- 
ering popular data first to decrease read degradation [48], 
and only recovering blocks that contain live data [45]. 
Similarly, architecture support for recovery can be ap- 
plied to our codes, such as hardware that minimizes data 
copying [13] and parity declustering [21]. 

Reducing the amount of data used in recovery has only 
emerged recently as a topic and the first results have 
given minimum recovery schedules for EVENODD [49] 
and row-diagonal parity [51], both RAID-6 codes. We 
present an algorithm that defines the recovery I/O lower 
bound for any XOR-based erasure code and allows mul- 
tiple codes to be compared for I/O recovery cost. 

Regenerating codes provide optimal recovery band- 
width [12] among storage nodes. This concept is differ- 
ent than minimizing I/O; each storage node reads all of 
its available data and computes and sends a linear combi- 
nation. Regenerating codes were designed for distributed 
systems in which wide-area bandwidth limits recovery 
performance. Exact regenerating codes [39] recover lost 
data exactly (not a new linear combination of data). In 
addition to minimizing recovery bandwidth, these codes 
can in some cases reduce recovery I/O. The relationship 
between recovery bandwidth and recovery data size re- 
mains an open problem. 

RAID systems suffer reduced performance during 
recovery because the recovery process interferes with 
workload. Tian et al. [48] reorder recovery so that fre- 
quently read data are rebuilt first. This minimizes the 
number of reads in degraded mode. Jin et al. [25] pro- 
pose reconfiguring an array from RAID-5 to RAID-0 
during recovery so that reads to strips of data that are 
not on the failed disk do not need to be recovered. Our 
treatment differs in that we separate degraded reads from 
recovery; we make degraded reads more efficient by re- 
building just the requested data, not the entire stripe. 
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Figure 1: One stripe from an erasure coded storage sys- 
tem. The parameters are k = 6,m = 3 andr = 4. 


3 Background: Erasure Coded Storage 


Erasure coded storage systems add redundancy for fault- 
tolerance. Specifically, a system of n disks 1s partitioned 
into / disks that hold data and m disks that hold coding 
information. The coding information is calculated from 
the data using an erasure code. For practical storage sys- 
tems, the erasure code typically has two properties. First, 
it must be Maximum Distance Separable (MDS), which 
means that if any m of the n disks fails, their contents 
may be recomputed from the k& surviving disks. Second, 
it must be systematic, which means that the & data disks 
hold unencoded data. 

An erasure coded storage system is partitioned into 
stripes, which are collections of disk blocks from each of 
the n disks. The blocks themselves are partitioned into 
symbols, and there is a fixed number of symbols for each 
disk in each stripe. We denote this quantity r. The stripes 
perform encoding and decoding as independent units in 
the disk system. Therefore, to alleviate hot spots that can 
occur because the coding disks may require more activ- 
ity than the data disks, one can rotate the disks’ identities 
on a stripe-by-stripe basis. 

For the purpose of our analysis, we focus on a sin- 
gle stripe. There are k data disks labeled Do,..., De—1 
and m coding disks labeled Co,..., C,—1. There are nr 
symbols in the stripe. We label the r symbols on data 
disk 2 as dio, dii,...,dj,.--1 and on coding disk 7 
AS Cj.0,Cj,1;--+,Cj,r—1- We depict an example system 
in Figure 1. In this example, & = 6, m = 3 (and there- 
fore n = 9) andr = 4. 

Erasure codes are typically defined so that each sym- 
bol is a w-bit word, where w is typically small, often 
one. Then the coding words are defined as computations 
of the data words. Thus for example, suppose an era- 
sure code were defined in Figure | for w = 1. Then 
each symbol in the stripe would be composed of one sin- 
gle bit. While that eases the definition of the erasure 
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Figure 2: Relationship between words, symbols and 
sealed blocks. 


code, it does not map directly to a disk system. In re- 
ality, it makes sense for each symbol in a sealed block 
to be much larger in size, on the order of kilobytes or 
megabytes, and for each symbol to be partitioned into w- 
bit words, which are encoded and decoded in parallel. 
Figure 2 depicts such a partitioning, where each symbol 
is composed of multiple words. When w = 1, this parti- 
tioning is especially efficient, because machines support 
bit operations like exclusive-or (XOR) over 64-bit and 
even 128-bit words, which in effect perform 64 or 128 
XOR operations on 1-bit words in parallel. 

When w = 1, the arithmetic is modulo 2: addition 
is XOR, and multiplication is AND. When w > 1, 
the arithmetic employed is Galois Field arithmetic, de- 
noted GF'(2”). In GF'(2™), addition is still XOR; how- 
ever multiplication is more complex, requiring a variety 
of implementation techniques that depend on hardware, 
memory, co-processing elements and w [17]. 


3.1 Matrix-Vector Definition 


All erasure codes may be expressed in terms of a matrix- 
vector product. An example is pictured in Figure 3. This 
continues the example from Figure 1, where k = 6, 
m = 3andr = 4; In this picture, the erasure code is de- 
fined precisely. This is a Cauchy Reed-Solomon code [6] 
optimized by the Jerasure library [38]. The word size, w 
equals one, so all symbols are treated as bits and arith- 
metic is composed solely of the XOR operation. The kr 
symbols of data are organized as a kr-element bit vector. 
They are multiplied by a nr x kr Generator matrix G7 .! 
The product is a vector, called the codeword, with nr el- 
ements. These are all of the symbols in the stripe. Each 
collection of r symbols in the vector is stored on a differ- 
ent disk in the system. 

Since the the top kr rows of G? compose an identity 
matrix, the first kr symbols in the codeword contain the 


'The archetypical presentation of erasure codes [26, 29, 32] typi- 
cally uses the transpose of this matrix; hence, we call this matrix Gr, 
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data. The remaining mr symbols are calculated from the 
data using the bottom mr rows of the Generator matrix. 

When up to m disks fail, the standard methodolgy for 
recovery is to select / surviving disks and create a de- 
coding matrix B from the kr rows of the Generator ma- 
trix that correspond to them. The product of B~' and 
the symbols in the & surviving disks yields the original 
data [6, 20, 33]. 

There are many MDS erasure codes that apply to 
storage systems. Reed-Solomon codes [40] are de- 
fined for all values of k and m. With a Reed-Solomon 
code, r = 1, and w must be such that 2” > n. Gener- 
ator matrices are constructed from a Vandermonde ma- 
trix so that any k x k subset of the Generator matrix 
is invertible. There is quite a bit of reference material 
on Reed-Solomon codes as they apply to storage sys- 
tems [33, 36, 6, 41], plus numerous open-source Reed- 
Solomon coding libraries [42, 38, 30, 31]. 

Cauchy Reed-Solomon codes convert Reed-Solomon 
codes with r = 1 and w > 1 to a code where r = w 
and w = 1. In doing so, they remove the expensive 
multiplication of Galois Fields and replace it with addi- 
tional XOR operations. There are an exponential number 
of ways to construct the Generator matrix of a Cauchy 
Reed-Solomon code. The Jerasure library attempts to 
construct a matrix with a minimal number of non-zero 
entries [38]. It is these matrices that we use in our exam- 
ples with Cauchy Reed-Solomon codes. 

For m = 2, otherwise known as RAID-6, there 
has been quite a bit of research on constructing codes 
where w = 1 and the CPU performance is optimized. 
EVENODD [3], RDP [11] and Blaum-Roth [5] codes all 
require r + 1 to be a prime number such that k < r+1 
(EVENODD) or k < r. The Liberation codes [34] 
require r to be a prime number and k < r, and the 
Liber8tion code [35] is defined for r = 8 and k < 
r. The latter three codes (Blaum-Roth, Liberation and 
Liber8tion) belong to a family of codes called Minimum 
Density codes, whose Generator matrices have a prov- 
ably minimum number of ones. 

Both EVENODD and RDP codes have been extrapo- 
lated to higher values of m [2, 4]. We call these Gen- 
eralized EVENODD and RDP. With m = 3, the same 
restrictions on r apply. For larger values of m, there are 
additional restrictions on r. The STAR code [24] is an 
instance of the generalized EVENODD codefor m = 3, 
where recovery 1s performed without using the Generator 
matrix. 

All of the above codes have a convenient feature that 
disk Cp is constructed as the parity of the data disks, as in 
RAID-4/5. Thus, the r rows of the Generator matrix im- 
mediately below the identity portion are composed of k 
(r x r) identity matrices. To be consistent with these 
RAID systems, we will refer to disk Co as the “P drive.” 
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Figure 3: The matrix-vector representation of an erasure code. The parameters are the same as Figure 1: k = 6, m = 3 
and r = 4. Symbols are one bit (i.e. w = 1). This is a Cauchy Reed-Solomon code for these parameters. 


4 Optimal Recovery of XOR-Based Era- 
sure codes 


When a data disk fails in an erasure coded disk array, it 1s 
natural to reconstruct it simply using the P drive. Each 
failed symbol is equal to the XOR of corresponding sym- 
bols on each of the other data disks, and the parity sym- 
bol on the P disk. We call this methodology “Reading 
from the P drive.” It requires k symbols to be read from 
disk for each decoded symbol. 

Although it is straightforward both in concept and im- 
plementation, in many cases, reading from the P drive 
requires more I/O than is necessary. In particular, de- 
pending on the erasure code, there are savings that can 
be exploited when multiple symbols are recovered in the 
same stripe. This effect was first demonstrated by Xiang 
et al. in RDP systems in which one may reconstruct all 
the failed blocks in a stripe by reading 25 percent fewer 
symbols than reading from the P drive [51]. In this sec- 
tion, we approach the problem in general. 


4.1 Algorithm to Determine the Minimum 
Number of Symbols for Recovery 


We present an algorithm for recovering from a single 
disk failure in any XOR-based erasure code with a mini- 
mum number of symbols. The algorithm takes as input a 
Generator matrix whose symbols are single bits and the 
identity of a failed disk and outputs equations to decode 
each failed symbol. The inputs to the equations are the 
symbols that must be read from disk. The number of in- 
puts is minimized. 

The algorithm is computationally expensive — for the 
systems evaluated for this paper, each instantiation took 
from seconds to hours of compute-time. However, for 
any realistic storage system, the number of recovery sce- 
narios is limited, so that the algorithm may be run ahead 


of time, and the results may be stored for when they are 
required by the system. 

We explain the algorithm by using the erasure code of 
Figure 4 as an example. This small code, with k = m = 
r = 2,1s not an MDS code; however its simplicity facil- 
itates our explanation. We label the rows of G" as R,, 
0 <2 < nr. Eachrow R; corresponds to a data or coding 
symbol, and to simplify our presentation, we will refer to 
symbols using Ff; rather than d;; or c;,;. Consider a set 
of symbols in the codeword whose corresponding rows 
in the Generator matrix sum to a vector of zeroes. One 
example is { Ro, Ro, R41}. We call such a set of symbols 
a decoding equation, because the fact their rows sum to 
zero allows us to decode any one symbol in the set as 
long as the remaining symbols are not lost. 

Suppose that we enumerate all decoding equations for 
a given Generator matrix, and suppose that some sub- 
set F’ of the codeword symbols are lost. For each sym- 
bol R; € F’, we can determine the set /; of decod- 
ing equations for R;. Formally, an equation e; € EF; if 
e;F = {R;}. For example, the equation represented by 
the set {Ro, Ro, R4} may be a decoding equation in e2 
so long as neither Ro nor Aq is in F’. 
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Figure 4: An example erasure code to explain the algo- 
rithm to minimize the number of symbols required to re- 
cover from failures. 
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We can recover all the symbols in F' by selecting one 
decoding equation e; from each set /,, reading the non- 
failed symbols in e; and then XOR-ing them to produce 
the failed symbol. To minimize the number of symbols 
read, our goal is to select one equation e; from each F; 
such that the number of symbols in the union of all e; is 
minimized. 

For example, suppose that a disk fails, and both Ro 
and A, are lost. A standard way to decode the failed 
bits is to read from the P drive and use coding sym- 
bols R4 and Rs. In equation form, F = {Ro, Ry} 
€9 = {Ro, Re, Ra} and ey = {R,, R3, Rs}. Since E09 
and e, have distinct symbols, their union is composed of 
six symbols, which means that four must be read for re- 
covery. However, if we instead use {R1, Ro, R7} for ex, 
then (e€9 U €1) has five symbols, meaning that only three 
are required for recovery. 

Thus, our problem is as follows: Given |F' sets of 
decoding equations Fo, £1,...£)7)-1, we wish to se- 
lect one equation from each set such that the size of the 
union of these equations is minimized. Unfortunately, 
this problem is NP-Hard in |F'| and |E;|.2, However, we 
can solve the problem for practical values of |F'| and | F;| 
(typically less than 8 and 25 respectively) by converting 
the equations into a directed, weighted graph and finding 
the shortest path through the graph. Given an instance of 
the problem, we convert it to a graph as follows. First, we 
represent each decoding equation in set form as an nr- 
element bit string. For example, { Ro, R2, Ry} is repre- 
sented by 10101000. 

Each node in the graph is also represented by an nr- 
element bit string. There is a starting node Z whose 
string is all zeroes. The remaining nodes are partitioned 
into |f’| sets, labeled So, .S1,...5)7)~1. For each equa- 
tion eg € Epo, there is a node sg € So whose bit string 
equals e9’s bit string. There is an edge from Z to each so 
whose weight is equal to the number of ones in s9’s bit 
string. 

For each node s; € S5;, there is an edge that cor- 
responds to each e;4; € i441. This edge is to a 
node s;41 € 5;41 whose bit string is equal to the bitwise 
OR of the bit strings of s; and e;,1. The OR calculates 
the union of the equations leading up to s; and e;,,. The 
weight of the edge is equal to the difference between the 
number of ones in the bit strings of s; and s;,,. The 
shortest path from Z to any node in S)\_; denotes the 
minimum number of elements required for recovery. If 
we annotate each edge with the decoding equation that 
creates it, then the shortest path contains the equations 
that are used for recovery. 

To illustrate, suppose again that F’ = { Ro, Ry}, mean- 
ing fo = Ro and f; = R,. The decoding equations 


2Reduction from Vertex Cover. 
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for Eo and £; are denoted by e;,; where 2 is the index of 
the lost symbol in the set F’ and 7 is an index into the set 
E;. Eo and £; are enumerated below: 


Eo by 





€9,0 = 10101000 
€9,1 = 10010010 
€9,.2 =10011101 
€9,3 = 10100111 


€1,9 = 01010100 
€1,1 = 01101110 
€1.2 = 01100001 
€1,3 = 01011011 


These equations may be converted to the graph de- 
picted in Figure 5, which has two shortest paths of length 
five: {€o,0,€1,2} and {€9,1,€1,0}. Both require three 
symbols for recovery: {R2, R4, R7} and {R3, Rs, Re}. 

While the graph clearly contains an exponential num- 
ber of nodes, one may program Dijkstra’s algorithm to 
determine the shortest path and prune the graph drasti- 
cally. For example, in Figure 5, the shortest path will be 
discovered before the the dotted edges and grayed nodes 
are considered by the algorithm. Therefore, they may be 
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Figure 5: The graph that results when Rp and FR are lost. 
4.2 Algorithm for Reconstruction 


When data disk 7 fails, the algorithm is applied for fF’ = 
{dio,-..,di,—-1}. When coding disk j fails, F = 
{Cj,0,-++5Cj,r—1}- If a storage system rotates the iden- 
tities of the disks on a stripe-by-stripe basis, then the av- 
erage number of symbols for all failed disks multiplied 
by the total number of stripes gives a measure of the sym- 
bols required to reconstruct a failed disk. 


4.3 Algorithm for Degraded Reads 


To take maximum advantage of parallel I/O, we assume 
that contiguous symbols in the file system are stored on 
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different disks in the storage system. In other words, if 
one is reading three symbols starting with symbol do 0, 
then those three symbols are do 9, di,9 and dz2,9, coming 
from three different disk drives. 

To evaluate degraded reads, we assume that an appli- 
cation desires to read B symbols starting at symbol d;,.,,, 
and that data disk f has failed. We determine the penalty 
of the failure to be the number of symbols required to 
perform the read, minus B. 

There are many cases that can arise from the differ- 
ing values of B, f, x and y. To illustrate, first suppose 
that B < k (which is a partial read case) and that none of 
the symbols to be read reside on disk f. Then the failure 
does not impact the read operation — it takes exactly B 
symbols to complete the read, and the penalty is zero. 

As a second case, consider when B = kr and dz = 
doo. Then we are reading exactly one stripe in its en- 
tirety. In this case, we have to read the (k—1)r non-failed 
data symbols to fulfill the read request. Therefore, we 
may recover very easily from the P drive by reading all 
of its symbols and decoding. The read requires kr = B 
symbols. Once again, the penalty is zero. 

However, consider the case when B = k, f = 0, and 
dz y = d1,9. Symbols d,,9 through d;,_1,9 are non-failed 
and must be read. Symbol do; must also be read and it 
is failed. If we use the P drive to recover, then we need 
to read d; 1 through d,_1,9 and cg,1. The total symbols 
read is 2k — 1: the failure has induced a penalty of k — 1 
symbols. 

In all of these cases, the degraded read is contained 
by one stripe. If the read spans two stripes, then we 
can calculate the penalty as the sum of the penalties of 
the read in each stripe. If the read spans more than two 
stripes, then we only need to calculate the penalties in the 
first and last stripe. This is because, as described above, 
whole-stripe degraded reads incur no penalty. 

When we perform a degraded read within a stripe, we 
modify our algorithm slightly. For each non-failed data 
symbol that must be read, we set its bit in the state of the 
starting node Z to one. For example, in Figure 4, sup- 
pose we are performing a degraded read where B = 2, 
f = Oand d, ,, = doo. There is one failed bit: F’ = do 0. 
Since d; 9 = Rz must be read, the starting state 7 of the 
shortest path graph is labeled 00100000. The algorithm 
correctly identifies that only co,o needs to be read to re- 
cover dg 9 and complete the read. 


5 Rotated Reed-Solomon Codes 


Before performing analyses of failed disk reconstruction 
and degraded reads, we present two instances of a new 
erasure code, called the Rotated Reed-Solomon code. 
These codes have been designed to be MDS codes that 
optimize the performance of degraded reads for single 


disk failures. The general formulation and theoretical 
evaluation of these codes is beyond the scope of this pa- 
per; instead, we present instances form € {2,3}. 
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Figure 6: A Reed-Solomon code for k = 6 and m = 
3. Symbols must be w-bit words such that w > 4, and 
arithmetic is over GF'(2). 


The most intuitive way to present a Rotated Reed- 
Solomon code is as a modification to a standard Reed- 
Solomon code. We present such a code form < 3 in 
Equation 1. As with all Reed-Solomon codes, r = 1. 


k-1 
forO <j <3, c¢j;9 = S (2) dio 
i=0 


(1) 


This is an MDS code so long as k, m, r and w adhere 
to some constraints, which we detail at the end of this 
section. This code is attractive because one may imple- 
ment encoding with XOR and multiplication by two and 
four in GF'(2”), which are all very fast operations. For 
example, the m = 2 version of this code lies at the heart 
of the Linux RAID-6 coding engine [1]. 

We present the code pictorally in Figure 6. A chain 
of circles denotes taking the XOR of d;,9; a chain of tri- 
angles denotes taking the XOR of 2‘d; 9, and a chain of 
squares denotes taking the XOR of 4’d;,9. To convert this 
code into a Rotated Reed-Solomon code, we allow r to 
take on any positive value, and define the coding symbols 
with Equation 2. 


a|&. 


at k-1 
Cjib = (27)*di w41y%r + SD” (27)*dip. (2) 
i=0 i= Bi 

Intuitively, the Rotated Reed-Solomon code converts 
the one-row code in Figure 6 into a multi-row code, 
and then the equations for coding disks 1 and 2 are 
split across adjacent rows. We draw the Rotated Reed- 
Solomon codes for / = 6 and m = {2,3} andr = 3 in 
Figures 7 and 8. 

These codes have been designed to improve the 
penalty of degraded reads. Consider a RAID-6 system 
that performs a degraded read of four symbols starting 
at dso when disk 5 has failed. If we reconstruct from 
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Figure 7: A Rotated Reed-Solomon code for k = 6, m = 
Zand 7 = 3. 


the P drive, we need to read doo through d4,9 plus coo 
to reconstruct ds. Then we read the non-failed sym- 
bols doi, di, and dz,,. The penalty is 5 symbols. With 
Rotated Reed-Solomon coding, d5,9, do,1, di,1 and dz 
all participate in the equation for cj,9. Therefore, by 
reading Ci0s do,1, di, d21, d3 0 and dao, one both de- 
codes ds5,9 and reads the symbols that were required to 
be read. The penalty is only two symbols. 
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Figure 8: A Rotated Reed-Solomon code for k = 6,m = 
3 andr = 3. 


With whole disk reconstruction, when r is an even 
number, one can reconstruct any failed data disk by read- 
ing 5(k + [£]) symbols. The process is exemplified 
fork = 6,m = 3 andr = 4 in Figure 9. The first data 
disk has failed, and the symbols required to reconstruct 
each of the failed symbols is darkened and annotated 
with the equation that is used for reconstruction. Each 
pair of reconstructed symbols in this example shares four 
data symbols for reconstruction. Thus, the whole recon- 
struction process requires a total of 16 symbols, as op- 
posed to 24 when reading from the P Drive. 

The process is similar for the other data drives. Re- 
constructing failed coding drives, however does not have 
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Figure 9: Reconstructing disk 0 when it fails, using Ro- 
tated Reed-Solomon coding for k = 6, m = 3, r = 4. 


the same benefits. We are unaware at present of how 
to reconstruct a coding drive with fewer than the maxi- 
mum kr symbols. 

As an aside, when more than one disk fails, Rotated 
Reed-Solomon codes may require much more computa- 
tion to recover than other codes, due to the use of matrix 
inversion for recovery. We view this property as less 1m- 
portant, since multiple disk failures are rare occurrences 
in practical storage systems, and computational overhead 
is less important than the I/O impact of recovery. 


5.1 


The Rotated Reed-Solomon code specified above in Sec- 
tion 5 is not MDS in general. In other words, there are 
settings of k, m, w and r which cannot tolerate the fail- 
ure of any m disks. Below, we detail ways to constrain 
these variables so that the Rotated Reed-Solomon code 
is MDS. Each of these settings has been verified by test- 
ing all combinations of m failures to make sure that they 
may be tolerated. They cover a wide variety of system 
sizes, certainly much larger than those in use today. 
The constraints are as follows: 


MDS Constraints 


m € {2,3} 
k < 36,andk+m<2”4+1 
w € {4,8, 16} 
r € {2,4,8, 16, 32} 


Moreover, when w = 16, r may be any value less 
than or equal to 48, except 15, 30 and 45. It is a matter of 
future research to derive general-purpose MDS construc- 
tions of Rotated Reed-Solomon codes. 


6 Analysis of Reconstruction 


We evaluate the minimum number of symbols required to 
recover a failed disk in erasure coding systems with a va- 
riety of erasure codes. We restrict our attention to MDS 
codes, and systems with six data disks and either two or 
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Figure 10: The minimum number of symbols required to 
reconstruct a failed disk in a storage system when k = 6 
and m € {2,3}. 


three coding disks. We summarize the erasure codes that 
we test in Table 1. For each code, if r has restrictions 
based on & and m, we denote it in the table and include 
the actual values tested in the last column. All codes, 
with the exception of Rotated Reed-Solomon codes, are 
XOR codes, and all without exception define the P drive 
identically. Since there are a variety of Cauchy Reed- 
Solomon codes that can be generated for any value of k, 
m and r, we use the codes generated by the Jerasure cod- 
ing library, which attempts to minimize the number of 
non-zero bits in the Generator matrix [38]. 





USENIX Association 


Code m | Restrictions onr | r tested 
EVENODD [3] 2 r+ 1 prime > k 6 
RDP [11] 2 r+1prime>k 6 
Blaum-Roth [5] 2 r+1prime >k 6 
Liberation [34] 2 r prime > k 7 
Liber8tion [35] 2 F267 2h 8 
STAR [24] 3 r+ 1 prime > k 6 
Generalized RDP [2] 3 r+1prime>k 6 
Cauchy RS [6] 2,3.| 22m 3-8 
Rotated 2,3 | None 6 


Table 1: The erasure codes and values of 7 tested. 


For each code listed in Table 1, we ran the algorithm 
from section 4.1 to determine the minimum number of 
symbols required to reconstruct each of the k + ™m failed 
disks in one stripe. The average number is plotted in 
Figure 10. The Y-axis of these graphs are expressed 
as a percentage of kr, which represents the number of 
symbols required to reconstruct from the P drive. This 
is also the number of symbols required when standard 
Reed-Solomon coding is employed. 

In both sides of the figure, the codes are ordered from 
best to worst, and two bars are plotted: the average num- 
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Figure 11: The density of the bottom mr rows of the 
Generator matrices for the codes in Figure 10. 


ber of symbols required when the failed disk is a data 
disk, and when the failed disk can be either data or cod- 
ing. In all codes, the performance of decoding data disks 
is better than re-encoding coding disks. As mentioned in 
Section 5, Rotated Reed-Solomon codes require kr sym- 
bols to re-encode. In fact, the C’, drive in all the RAID-6 
codes require kr symbols to re-encode. Regardless, we 
believe that presenting the performance for data and cod- 
ing disks is more pertinent, since disk identities are likely 
to be rotated from stripe to stripe, and therefore a disk 
failure will encompass all n decoding scenarios. 


For the RAID-6 systems, the minimum density codes 
(Blaum-Roth, Liberation and Liber8tion) as a whole ex- 
hibit excellent performance, especially when data disks 
fail. It is interesting that the Liber8tion code, whose con- 
struction was the result of a theory-less enumeration of 
matrices, exhibits the best performance. 


Faced with these results, we sought to determine if 
Generator matrix density has a direct impact on disk re- 
covery. Figure 11 plots the density of the bottom mr 
rows of the Generator matrices for each of these codes. 
To a rough degree, density is correlated to recovery 
performance of the data disks; however the correlation 
is only approximate. The precise relationship between 
codes and their recovery performance is a direction of 
further research. 


Regardless, we do draw some important conclusions 
from the work. The most significant one is that reading 
from the P drive or using standard Reed-Solomon codes 
is not a good idea in cloud storage systems. If recovery 
performance is a dominant concern, then the Liber8tion 
code 1s the best for RAID-6, and Generalized RDP is the 
best for three fault-tolerant systems. 
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Figure 12: The penalty of degraded reads in storage sys- 
tems with k = 6. 


7 Analysis of Degraded Reads 


To evaluate degraded reads, we compute the average 
penalty of degraded reads for each value of B from one 
to 20. The average is over all & potential data disks fail- 
ing and over all Ar potential starting points for the read 
(all potential d,,,). This penalty is plotted in Figure 12 
as a factor of B, so that the impact of the penalty rela- 
tive to the size of the read 1s highlighted. Since whole- 
stripe reads incur no penalty, the penalty of all values 
of B > kr are the same, which means that as B grows, 
the penalty factor approaches one. Put another way, large 
degraded reads incur very little penalty. 

We plot only a few erasure codes because, with the 
exception of Rotated Reed-Solomon codes, all perform 
roughly the same. The Rotated Reed-Solomon codes, 
which were designed expressly for degraded reads, re- 
quire significantly fewer symbols on degraded reads. 
This is most pronounced when B is between 5 and 10. 
To put the results in context, suppose that symbols are | 
MB and that a cloud application is reading collections of 
10 MB files such as MP3 files. If the system is in de- 
graded mode, then using Rotated Reed-Solomon codes 
with m = 3 incurs a penalty of 4.6%, as opposed to 
19.6% using regular Reed-Solomon codes. 

Combined with their good performance with whole- 
disk recovery, the Rotated Reed-Solomon codes provide 
a very good blend of properties for cloud storage sys- 
tems. Compared to regular Reed-Solomon codes, or 
to recovery strategies that employ only the P-drive for 
single-disk failures, their improvement is significant. 


$8 Evaluation 


We have built a storage system to evaluate the recov- 
ery of sealed blocks. The goal of our experiments is to 
determine the points at which the theoretical results of 
sections 6 and 7 apply to storage systems configured as 
cloud file system nodes. 
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Figure 13: The I/O performance of RAID-6 codes recov- 
ering from a single disk failure averaged over all disks 
(data and parity). 


Experimental Setup: All experiments are run on a 12- 
disk array using a SATA interface running on a quad-core 
Intel Xeon E5620 processor with 2GB of RAM. Each 
disk is a Seagate ST3500413AS Barracuda with 500 GB 
capacity and operates at 7200 rpm. The Jerasure v1.2 
library was used for construction and manipulation of 
the Generator matrices and for Galois Field arithmetic 
in rotated Reed-Solomon coding [38]. All tests mirror 
the configurations in Table 1, evaluating a variety of era- 
sure codes for which k = 6 and m € {2,3}. Each data 
point is the average of twenty trials. Error bars denote a 
standard deviation from the mean. 


Evaluating Reconstruction: In these experiments, we 
affix the symbol size at 16 MB, which results in sealed 
blocks containing between 288 and 768 MB, depending 
on the values of r and k&. After creating a sealed block, 
we measure the performance of reconstructing each of 
the k + m disks when it fails. We plot the average per- 
formance in Figures 13 and 14. Each erasure code con- 
tains two measurements: the performance of recovering 
from the P drive, and the performance of optimal recov- 
ery. The data recovery rate is plotted. This is the speed of 
recovering the lost symbols of data from the failed disk. 
As demonstrated in Figure 13, for the RAID-6 codes, 
optimal recovery improves performance by a factor of 
15% to 30%, with Minimum-Density codes realizing the 
largest performance gains. As the analysis predicts, the 
Liber8tion code outperforms all other codes. In general, 
codes with large r and less density have better perfor- 
mance. Cauchy Reed-Solomon codes with r > 6 com- 
pare well, but with r = 3, they give up about 10% of 
recovery performance. The rotated RS code performs 
roughly the same as Cauchy-RS codes with r = 8. 
Figure 14 confirms that Generalized-RDP substan- 
tially outperforms the other codes. Cauchy Reed- 
Solomon codes have different structure form = 3 than 
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Figure 14: The I/O performance of m = 3 codes recov- 
ering from a single disk failure. 


m = 2, with smaller r offering better performance. This 
result matches the analysis in Section 6, but is surprising 
nonetheless, because the smaller r codes are denser. 

The large block size in cloud file systems means that 
data transfer dominates recovery costs. All of the codes 
read data at about 120 MB/s on aggregate. The results in 
Figures 13 and 14 match those in Figure 10 closely. We 
explore the effect of the symbol size and, thus, the sealed 
block size in the next experiment. 


Size of Sealed Blocks: Examining the relationship be- 
tween recovery performance and the amount of the data 
underlying each symbol shows that optimal recovery 
works effectively only for relatively large sealed blocks. 
Figure 15 plots the recovery data rate as a function of 
symbol size for GenRDP and Liber8tion with and with- 
out optimal recovery. We chose these codes because 
their optimized version uses the fewest recovery sym- 
bols atm = 2 (Liber8tion) and m = 3 (GenRDP). Our 
disk array recovers data sequentially at approximately 20 
MB/s. This rate is realized for erasure codes with any 
value of r when the code is laid out on an array of disks. 
Recovery reads each disk in a sequential pass and re- 
builds the data. Unoptimized GenRDP and Liber8tion 
approach this rate with increasing symbol size. Full se- 
quential performance is realized for symbols of size 16M 
or more, corresponding to sealed blocks of size 768 MB 
for Liber8tion and 576 MB for GenRDP. 

We parameterize experiments by symbol size because 
recovery performance scales with the symbol size. Op- 
timal recovery determines the minimum number of sym- 
bols needed and accesses each symbol independently, in- 
curring a seek penalty for most symbols: those not adja- 
cent to other needed symbols. For small symbols, this 
recovery process is inefficient. There is some noise in 
our data at for symbols of size 64K and 256K that comes 
from disk track read-ahead and caching. 

Optimal recovery performance exceeds the stream- 
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Figure 15: Data recovery rate as a function of the code- 
word symbol size. 


ing recovery rate above 4M symbols, converging to the 
throughput expected by analysis as disk seeks become 
fully amortized. Sealed blocks using these parameters 
can expect the recovery performance of distributed era- 
sure codes to exceed that realized by disk arrays. 

As symbols and stripes become too large, recovery re- 
quires more memory than is available and performance 
degrades. The 64 MB point for Liber8tion(Opt) with 
r = 8 shows a small decline from 16 MB, because the 
encoded stripe is 2.4 GB, larger than the 2G of memory 
on our system. 
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Figure 16: The throughput of degraded reads as a func- 
tion of the number of symbols read. 


Degraded Reads: Figure 16 plots the performance of 
degraded reads as a function of the number of symbols 
read with k = 6 and 16 MB per symbol. We com- 
pare Rotated Reed-Solomon codes with P Drive recov- 
ery and with the best performing optimal recovery codes, 
Liber8tion for 7m = 2 and GenRDP form = 3. We 
measure the degraded read performance of read requests 
ranging from | symbol to 20 symbols. For each read 
size, we measure the performance of starting at each of 
the potential kr starting blocks in the stripe, and plot the 
average speed of the read when each data disk fails. The 
results match Figure 12 extremely closely. When reading 
one symbol, all algorithms perform identically, because 
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Figure 17: Relative cost of computation of XORs and 
read I/O during recovery. 


they all either read the symbol from a non-failed disk or 
they must read six disks to reconstruct. When reading 
eight symbols, Rotated Reed-Solomon coding shows the 
most improvement over the others, reading 13% faster 
than Liber8tion (m = 2) and Generalized RDP (m = 3). 
As predicted by Figure 12, the improvement lessens as 
the number of symbols read increases. The overall speed 
of all algorithms improves as the number of symbols read 
increases, because fewer data blocks are read for recov- 
ery and then thrown away. 


The Dominance of I/O: We put forth that erasure 
codes should be evaluated based on the the data used in 
recovery and degraded reads. Implicit in this thesis 1s that 
the computation for recovery is inconsequential to over- 
all performance. Figure 17 shows the relative I/O costs 
and processing time for recovery of a single disk failure. 
A single stripe with a 1 MB symbol was recovered for 
each code. Codes have different stripe sizes. Computa- 
tion cost never exceeds 10% of overall costs. Further- 
more, this computation can be overlapped with I/O when 
recovering multiple sealed blocks. 


9 Discussion 


Our findings provide guidance as to how to deploy era- 
sure coding in the cloud file systems with respect to 
choosing a specific code and the size of sealed blocks. 
Cloud file systems distribute the coded blocks from each 
stripe (sealed block) on a different set of storage nodes. 
This strategy provides load balance and incremental scal- 
ability in the data center. It also prevents correlated fail- 
ures from resulting in data loss and mitigates the effect 
that any single failure has on a data set or application 
[15]. However, it does mean that each stripe 1s recovered 
independently from a different set of disks. To achieve 
good recovery performance when recovering indepen- 
dent stripes, codeword symbols need to be large enough 
to amortize disk seek overhead. Our results recommend 
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a minimum symbol size of 4 MB and prefer 16 MB. This 
translates to a minimum sealed block size of 144 MB 
and preferred size of 576 MB for RDP and GenRDP, for 
example. Cloud file systems would benefit from increas- 
ing the sealed blocks to these size from the 64 MB de- 
fault. Increasing the symbol size has drawbacks as well. 
It increases memory consumption during recovery and 
increases the latency of degraded reads, because larger 
symbols need to recover more data. 


Codes differ substantially in recovery performance, 
which demands a careful selection of code and parame- 
ters for cloud file systems. Optimally-sparse, Minimum- 
Density codes tend to perform best. The Liber8tion code 
and Generalized RDP are preferred for m = 2 and 
m = 3 respectiveley. Reed-Solomon codes will con- 
tinue to be popular for their generality. For some Reed- 
Solomon codes, including rotated-RS codes, recovery 
performance may be improved by more than 20%. How- 
ever, the number of symbols per disk (7) has significant 
impact. For k = 6 data disks, the best values are r = 7 
form = 2 and r= 4 form = 3: 


Several open problems remain with respect to optimal 
recovery and degraded reads. While our algorithm can 
determine the minimum number of symbols needed for 
recovery for any given code, it remains unknown how to 
generate recovery-optimal erasure codes. We are pursu- 
ing this problem both analytically and through a progra- 
matic search of feasible generator matrixes. Rotated RS 
codes are a first result in lowering degraded read costs. 
Lower bounds for the number of symbols needed for de- 
graded reads have not been determined. 


We have restricted our treatment to MDS codes, since 
they are used almost exclusively in practice because of 
their optimal storage efficiency. However, some codes 
with decreased storage efficiency have much lower re- 
covery costs than MDS [27, 18, 28, 23, 19]. Exploring 
non-MDS codes more thoroughly will help guide those 
building cloud systems in the tradeoffs between storage 
efficiency, fault-tolerance, and performance. 
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Abstract 


To provide fault tolerance for cloud storage, recent stud- 
ies propose to stripe data across multiple cloud vendors. 
However, if a cloud suffers from a permanent failure and 
loses all its data, then we need to repair the lost data from 
other surviving clouds to preserve data redundancy. We 
present a proxy-based system for multiple-cloud storage 
called NCCloud, which aims to achieve cost-effective re- 
pair for a permanent single-cloud failure. NCCloud is 
built on top of network-coding-based storage schemes 
called regenerating codes. Specifically, we propose 
an implementable design for the functional minimum- 
storage regenerating code (F-MSR), which maintains the 
same data redundancy level and same storage require- 
ment as in traditional erasure codes (e.g., RAID-6), but 
uses less repair traffic. We implement a proof-of-concept 
prototype of NCCloud and deploy it atop local and com- 
mercial clouds. We validate the cost effectiveness of F- 
MSR in storage repair over RAID-6, and show that both 
schemes have comparable response time performance in 
normal cloud storage operations. 


1 Introduction 


Cloud storage provides an on-demand remote backup so- 
lution. However, using a single cloud storage vendor 
raises concerns such as having a single point of failure 
[3] and vendor lock-ins [1]. As suggested in [1, 3], a 
plausible solution is to stripe data across different cloud 
vendors. While striping data with conventional erasure 
codes performs well when some clouds experience short- 
term failures or foreseeable permanent failures [1], there 
are real-life cases showing that permanent failures do oc- 
cur and are not always foreseeable [23, 14, 20]. 

This work focuses on unexpected cloud failures. 
When a cloud fails permanently, it is important to ac- 
tivate storage repair to maintain the level of data re- 
dundancy. A repair operation reads data from existing 
surviving clouds and reconstructs the lost data in a new 
cloud. It is desirable to reduce the repair traffic, and 
hence the monetary cost, due to data migration. 

Recent studies (e.g., [6, 8, 16, 25]) propose regener- 
ating codes for distributed storage. Regenerating codes 
are built on the concept of network coding [2]. They aim 
to intelligently mix data blocks that are stored in existing 


storage nodes, and then regenerate data at a new storage 
node. It is shown that regenerating codes reduce the data 
repair traffic over traditional erasure codes subject to the 
same fault-tolerance level. Despite the favorable prop- 
erty, regenerating codes are mainly studied in the theo- 
retical context. It remains uncertain regarding the prac- 
tical performance of regenerating codes, especially with 
the encoding overhead incurred in regenerating codes. 

In this paper, we propose NCCloud, a proxy-based 
system designed for multiple-cloud storage. We pro- 
pose the first implementable design for the functional 
minimum-storage regenerating code (F-MSR) [8], and in 
particular, we eliminate the need of performing encoding 
operations within storage nodes as in existing theoretical 
studies. Our F-MSR implementation maintains double- 
fault tolerance and has the same storage cost as in tra- 
ditional erasure coding schemes based on RAID-6, but 
uses less repair traffic when recovering a single-cloud 
failure. On the other hand, unlike most erasure coding 
schemes that are systematic (1.e., original data chunks are 
kept), F-MSR is non-systematic and stores only linearly 
combined code chunks. Nevertheless, F-MSR is suited 
to rarely-read long-term archival applications [6]. 

We show that in a practical deployment setting, F- 
MSR can save the repair cost by 25% compared to 
RAID-6 for a four-cloud setting, and up to 50% as the 
number of clouds further increases. In addition, we con- 
duct extensive evaluations on both local cloud and com- 
mercial cloud settings. We show that our F-MSR imple- 
mentation only adds a small encoding overhead, which 
can be easily masked by the file transfer time over the 
Internet. Thus, our work validates the practicality of F- 
MSR via NCCloud, and motivates further studies of re- 
alizing regenerating codes in large-scale deployments. 


2 Motivation of F-MSR 


We consider a distributed, multiple-cloud storage setting 
from a client’s perspective, such that we stripe data over 
multiple cloud vendors. We propose a proxy-based de- 
sign [1] that interconnects multiple cloud repositories, as 
shown in Figure l(a). The proxy serves as an interface 
between client applications and the clouds. If a cloud 
experiences a permanent failure, the proxy activates the 
repair operation, as shown in Figure 1(b). That is, the 
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Cloud 1 


Cloud 1 








Cloud 2 Cloud 2 


Cloud 3 Cloud 3 


Cloud 4 Cloud 4 


(a) Normal operation (b) Repair operation 
Figure 1: Proxy-based design for multiple-cloud stor- 
age: (a) normal operation, and (b) repair operation when 
Cloud node | fails. During repair, the proxy regenerates 
data for the new cloud. 


proxy reads the essential data pieces from other surviv- 
ing clouds, reconstructs new data pieces, and writes these 
new pieces to a new cloud. Note that this repair operation 
does not involve direct interactions among the clouds. 

We consider fault-tolerant storage based on maximum 
distance separable (MDS) codes. Given a file object, we 
divide it into equal-size native chunks, which in a non- 
coded system, would be stored on k clouds. With cod- 
ing, the native chunks are encoded by linear combina- 
tions to form code chunks. The native and code chunks 
are distributed over n > k clouds. When an MDS code is 
used, the original file object may be reconstructed from 
the chunks contained in any k of the n clouds. Thus, it 
tolerates the failure of any n — k clouds. We call this 
feature the MDS property. The extra feature of F-MSR 
is that reconstructing a single native or code chunk may 
be achieved by reading up to 50% less data from the sur- 
viving clouds than reconstructing the whole file. 

This paper considers a multiple-cloud setting that 1s 
double-fault tolerant (e.g., RAID-6) and provides data 
availability toward at most two cloud failures (e.g., a few 
days of outages [7]). That is, we set k = n — 2. We 
expect that such a fault tolerance level suffices in prac- 
tice. Given that a permanent failure is less frequent but 
possible, our primary objective is to minimize the cost of 
storage repair for a permanent single-cloud failure, due 
to the migration of data over the clouds. 

We define the repair traffic as the amount of outbound 
data being read from other surviving clouds during the 
single-cloud failure recovery. Our goal is to minimize 
the repair traffic for cost-effective repair. Here, we do not 
consider the inbound traffic (i.e., the data being written 
to a cloud), as it is free of charge in many cloud vendors 
(see Table 1 in Section 5). 

We now show how F-MSR saves the repair traffic via 
an example. Suppose that we store a file of size M on 
four clouds, each viewed as a logical storage node. Let 
us first consider RAID-6, which is double-fault tolerant. 
Here, we consider the RAID-6 implementation based on 
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Reed-Solomon codes [26], as shown in Figure 2(a). We 
divide the file into two native chunks (i.e., A and B) of 
size M/2 each. We add two code chunks formed by the 
linear combinations of the native chunks. Suppose now 
that Node | is down. Then the proxy must download 
the same number of chunks as the original file from two 
other nodes (e.g., B and A + B from Nodes 2 and 3, re- 
spectively). It then reconstructs and stores the lost chunk 
A on the new node. The total storage size is 2M, while 
the repair traffic is M. 

We now consider the double-fault tolerant implemen- 
tation of F-MSR in a proxy-based setting, as shown 
in Figure 2(b). F-MSR divides the file into four na- 
tive chunks, and constructs eight distinct code chunks 
P,,--+, Ps formed by different linear combinations of 
the native chunks. Each code chunk has the same size 
M/4 as a native chunk. Any two nodes can be used to re- 
cover the original four native chunks. Suppose Node 1 is 
down. The proxy collects one code chunk from each sur- 
viving node, so it downloads three code chunks of size 
M/4 each. Then the proxy regenerates two code chunks 
P; and P35 formed by different linear combinations of 
the three code chunks. Note that P; and P3 are still lin- 
ear combinations of the native chunks. The proxy then 
writes Pj and P; to the new node. In F-MSR, the stor- 
age size is 2 (as in RAID-6), but the repair traffic is 
0.75. M, which is 25% of saving. 

To generalize F-MSR for n storage nodes, we divide a 
file of size M into 2(n — 2) native chunks, and generate 
4(m — 2) code chunks. Then each node will store two 

M 


code chunks of size Xn —3) each. Thus, the total storage 
Size 1S ae To repair a failed node, we download one 
chunk from each of n — 1 nodes, so the repair traffic is 
M(n-1) 
2(n—2) 


also ee while the repair traffic is 17. When n is large, 
F-MSR can save the repair traffic by close to 50%. 

Note that F-MSR keeps only code chunks rather than 
native chunks. To access a single chunk of a file, we need 
to download and decode the entire file for the particular 
chunk. Nevertheless, F-MSR is acceptable for long-term 
archival applications, whose read frequency is typically 
low [6]. Also, to restore backups, it is natural to retrieve 
the entire file rather than a particular chunk. 

This paper considers the baseline RAID-6 implemen- 
tation using Reed-Solomon codes. Its repair method is to 
reconstruct the whole file, and is applicable for all era- 
sure codes in general. Recent studies [18, 28, 29] show 
that data reads can be minimized specifically for XOR- 
based erasure codes. For example, in RAID-6, data reads 
can be reduced by 25% compared to reconstructing the 
whole file [28, 29]. Although such approaches are sub- 
optimal (recall that F-MSR can save up to 50% of repair 
traffic in RAID-6), their use of efficient XOR operations 
can be of practical interest. 





. In contrast, for RAID-6, the total storage size is 


USENIX Association 


USENIX Association 





Node 4 


(a) RAID-6 


3 F-MSR Implementation 


In this section, we present a systematic approach for 1m- 
plementing F-MSR. We specify three operations for F- 
MSR on a particular file object: (1) file upload; (2) file 
download; (3) repair. A key difference of our imple- 
mentation from prior theoretical studies is that we do 
not require storage nodes to have encoding capabilities, 
so our implementation can be compatible with today’s 
cloud storage. Another key design issue is that instead of 
simply generating random linear combinations for code 
chunks (as assumed in [8]), we also guarantee that the 
generated linear combinations always satisfy the MDS 
property to ensure data availability, even after iterative 
repairs. Here, we implement F-MSR as an MDS code 
for general (n,k). We assume that each cloud repository 
corresponds to a logical storage node. 


3.1 File Upload 


To upload a file F’, we first divide it into k(n — k) equal- 
size native chunks, denoted by (F);=1,2,.-.,4(n—k)- We 
then encode these k(n — k) native chunks into n(n — k) 
code chunks, denoted by (P;);=1,2,....n(n—k)- Each P; 
is formed by a linear combination of the k(n — k) na- 
tive chunks. Specifically, we let EM = [a;,;| be an 
n(n—k) x k(n—k) encoding matrix for some coefficients 
Og (where? = IjanagRi tk) and 9 Ay neh —)) 
in the Galois field GF(2°). We call a row vector of 
EM an encoding coefficient vector (ECV), which con- 
tains k(n — k) elements. We let ECV; denote the i*” 
row vector of EM. We compute each P; by the scalar 
product of ECV; and the native chunk vector (F;), i.e., 
Ba eek torr = 1,2, 
all arithmetic operations are per orice over GF(2°). The 
code chunks are then evenly stored in the n storage 
nodes, each having (n — k) chunks. Also, we store the 
whole EM in a metadata object that is then replicated to 
all storage nodes (see Section 4). There are many ways 
of constructing EM, as long as it satisfies the MDS prop- 
erty and the repair MDS property (see Section 3.3). Note 
that the implementation details of the arithmetic opera- 
tions in Galois Fields are extensively discussed in [15]. 


,n(n — k), where 


Object of size M 


Node 1 






Node 2 
New node 


Node 3 








Node 4 





(b) F-MSR 
Figure 2: Examples of repair operations in RAID-6 and F-MSR with n = 4 and k = 2. 


3.2 File Download 


To download a file, we first download the correspond- 
ing metadata object that contains the ECVs. Then we 
select any k of the n storage nodes, and download the 
k(n — k) code chunks from the k nodes. The ECVs of 
the k(n —k) code chunks can forma k(n —k) x k(n—k) 
square matrix. If the MDS property is maintained, then 
by definition, the inverse of the square matrix must ex- 
ist. Thus, we multiply the inverse of the square matrix 
with the code chunks and obtain the original k(n — k) 
native chunks. The idea is that we treat F-MSR as a stan- 
dard Reed-Solomon code, and our technique of creating 
an inverse matrix to decode the original data has been 
described in the tutorial [22]. 


3.3. Iterative Repairs 


We now consider the repair of F-MSR for a file F’ for a 
permanent single-node failure. Given that F-MSR regen- 
erates different chunks in each repair, one challenge is to 
ensure that the MDS property still holds even after itera- 
tive repairs. This is in contrast to regenerating the exact 
lost chunks as in RAID-6, which guarantees the invari- 
ance of the stored chunks. Here, we propose a two-phase 
checking heuristic as follows. Suppose that the (r — 1)*” 
repair is successful, and we now consider how to operate 
the r*” repair for a single permanent node failure (where 
r > 1). We first check if the new set of chunks in all stor- 
age nodes satisfies the MDS property after the r“” repair. 
In addition, we also check if another new set of chunks in 
all storage nodes still satisfies the MDS property after the 
(r + 1)” repair, should another single permanent node 
failure occur (we call this the repair MDS property). We 
now describe the r*” repair as follows. 


Step 1: Download the encoding matrix from a surviving 
node. Recall that the encoding matrix EM specifies the 
ECVs for constructing all code chunks via linear combi- 
nations of native chunks. We use these ECVs for our later 
two-phase checking heuristic. Since we embed EM ina 
metadata object that is replicated, we can simply down- 
load the metadata object from one of the surviving nodes. 


Step 2: Select one random ECV from each of the n — 1 
surviving nodes. Note that each ECV in EM corre- 


FAST 712: 10th USENIX Conference on File and Storage Technologies 


267 


268 


sponds to a code chunk. We randomly pick one ECV 
from each of the n — 1 surviving nodes. We call these 
ECVs to be ECV;,, ECV;,, ---, ECV 


Step 3: Generate a repair matrix. We construct a (n — 
k) x (n—1) repair matrix RM = |7y;,;|, where each ele- 
ment 7;,; (where? = 1,...,n—kKandj =1,...,n—1)is 
randomly selected in GF(2°). Note that the idea of gen- 
erating a random matrix for reliable storage is consistent 
with that in [24]. 


Step 4: Compute the ECVs for the new code chunks and 
reproduce a new encoding matrix. We multiply RM 
with the ECVs selected in Step 2 to construct n — k 
new ECVs, denoted by ECV; = an gee. 08 
2=1,2,---,n—k. Then we reproduce a new encoding 
matrix, denoted by EM’, which is given by: 


| 7 


bit / 
2” row vector of EM’ = os 
ECV... 21S anew node. 


Step 5: Given EM’, check if both the MDS and repair 
MDS properties are satisfied. Intuitively, we verify the 
MDS property by enumerating all (7°) subsets of k nodes 
to see if each of their corresponding encoding matrices 
forms a full rank. For the repair MDS property, we check 
that for any failed node (out of n nodes), we can collect 
any one out of n—k chunks from the other n—1 surviving 
nodes and reconstruct the chunks in the new node, such 
that the MDS property is maintained. The number of 
checks performed for the repair MDS property is at most 
n(n — k)"~-1(7). If n is small, then the enumeration 
complexities for both MDS and repair MDS properties 
are manageable. If either one phase fails, then we return 
to Step 2 and repeat. We emphasize that Steps | to 5 only 
deal with the ECVs, so their overhead does not depend 
on the chunk size. 


Step 6: Download the actual chunk data and regenerate 
new chunk data. If the two-phase checking in Step 5 
succeeds, then we proceed to download the nm — 1 chunks 
that correspond to the selected ECVs in Step 2 from the 
n — 1 surviving storage nodes to NCCloud. Also, using 
the new ECVs computed in Step 4, we regenerate new 
chunks and upload them from NCCloud to a new node. 


Remark: We claim that in addition to checking the MDS 
property, checking the repair MDS property is essential 
for iterative repairs. We conduct simulations to justify 
that checking the repair MDS property can make itera- 
tive repairs sustainable. In our simulations, we consider 
multiple rounds of permanent node failures for different 
values of n. Specifically, in each round, we randomly 
pick a node to permanently fail and trigger a repair. We 
say a repair is bad if the loop of Steps 2 to 5 is repeated 
over 10 times. We observe that without checking the re- 
pair MDS property, we see a bad repair very quickly, 
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say after no more than 7 and 2 rounds for n = 8 and n 
= 12, respectively. On the other hand, checking the re- 
pair MDS property makes iterative repairs sustainable for 
hundreds of rounds for different values of n, and we do 
not yet find any bad repair after extensive simulations. 


4 NCCloud Design and Implementation 


We implement NCCloud as a proxy that bridges user 
applications and multiple clouds. Its design is built on 
three layers. The file system layer presents NCCloud as a 
mounted drive, which can thus be easily interfaced with 
general user applications. The coding layer deals with 
the encoding and decoding functions. The storage layer 
deals with read/write requests with different clouds. 

Each file is associated with a metadata object, which is 
replicated at each repository. The metadata object holds 
the file details and the coding information (e.g., encoding 
coefficients for F-MSR). 

NCCloud is mainly implemented in Python, while the 
storage schemes are implemented in C for better effi- 
ciency. The file system layer is built on FUSE [12]. 
The coding layer implements both RAID-6 and F-MSR. 
RAID-6 is built on zfec [30], and our F-MSR implemen- 
tation mimics the optimizations made in zfec for a fair 
comparison. 

Recall that F-MSR generates multiple chunks to be 
stored on the same repository. To save the request cost 
overhead (see Table 1), multiple chunks destined for the 
same repository are aggregated before upload. Thus, 
F-MSR keeps only one (aggregated) chunk per file ob- 
ject on each cloud, as in RAID-6. To retrieve a specific 
chunk, we calculate its offset within the combined chunk 
and issue a range GET request. 


5 Evaluation 


We now use our NCCloud prototype to evaluate RAID-6 
(based on Reed-Solomon codes) and F-MSR in multiple- 
cloud storage. In particular, we focus on the setting n = 
4 and k = 2. We expect that using n = 4 clouds may 
suffice for practical deployment. Based on this setting, 
we allow data retrieval with at most two cloud failures. 

The goal of our experiments is to explore the practi- 
cality of using F-MSR in multiple-cloud storage. Our 
evaluation consists of two parts. We first compare the 
monetary costs of using RAID-6 and F-MSR based on 
the price plans of today’s cloud vendors. We also em- 
pirically evaluate the response time performance of our 
NCCloud prototype atop a local cloud and also a com- 
mercial cloud vendor. 


5.1 Cost Analysis 


Table 1 shows the monthly price plans for three major 
vendors as of September 2011. For Amazon S3, we take 
the cost from the first chargeable usage tier (i.e., storage 


USENIX Association 


USENIX Association 


Storage (per GB) 

Data transfer in (per GB) 

Data transfer out (per GB) 
PUT,POST (per 10K requests) 
GET (per 10K requests) 





Table 1: Monthly price plans (in US dollars) for Amazon 
S3 (US Standard), Rackspace Cloud Files and Windows 
Azure Storage, as of September, 2011. 


usage within 1TB/month; data transferred out more than 
1GB/month but less than 10TB/month). 


From the analysis in Section 2, we can save 25% of 
the download traffic during storage repair when n = 4. 
The storage size and the number of chunks being gen- 
erated per file object are the same in both RAID-6 and 
F-MSR (assuming that we aggregate chunks in F-MSR 
as described in Section 4). However, in the analysis, 
we have ignored two practical considerations: the size 
of metadata (Section 4) and the number of requests is- 
sued during repair. We now argue that they are negligi- 
ble and that the simplified calculations based only on file 
size suffice for real-life applications. 


Metadata size: Our implementation currently keeps the 
F-MSR metadata size within 160B, regardless of the file 
size. NCCloud aims at long-term backups (see Sec- 
tion 2), and can be integrated with other backup applica- 
tions. Existing backup applications (e.g., [27, 11]) typi- 
cally aggregate small files into a larger data chunk in or- 
der to save the processing overhead. For example, the de- 
fault setting for Cumulus [27] creates chunks of around 
4MB each. The metadata size is thus usually negligible. 


Number of requests: From Table 1, we see that some 
cloud vendors nowadays charge for requests. RAID-6 
and F-MSR differ in the number of requests when re- 
trieving data during repair. Suppose that we store a file 
of size 4MB with n = 4 and k = 2. During repair, 
RAID-6 and F-MSR retrieve two and three chunks, re- 
spectively (see Figure 2). The cost overhead due to the 
GET request for RAID-6 is at most 0.427%, and that for 
F-MSR is at most 0.854%, a mere 0.427% increase. 


5.2 Response Time Analysis 


We deploy our NCCloud prototype in real environments. 
We then evaluate the response time performance of dif- 
ferent operations in two scenarios. The first part ana- 
lyzes in detail the time taken by different NCCloud op- 
erations, and is done on a local cloud storage in order 
to lessen the effects of network fluctuations. The second 
part evaluates how NCCloud actually performs in com- 
mercial clouds. All results are averaged over 40 runs. 
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Figure 3: Response times of main NCCloud operations. 
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Figure 4: Breakdown of response time when dealing with 
SOOMB file. 


5.2.1 Ona Local Cloud 


The experiment on a local cloud is carried out on an 
object-based storage platform based on OpenStack Swift 
1.4.2 [21]. NCCloud is mounted on a machine with Intel 
Xeon E5620 and 16GB RAM. This machine is connected 
to an OpenStack Swift platform attached with a number 
of storage servers, each with Intel Core 15-2400 and 8GB 
RAM. We create (n+1) = 5 containers on Swift, so each 
container resembles a cloud repository (one of them is a 
spare node used in repair). 

In this experiment, we test the response time of three 
basic operations of NCCloud: (a) file upload; (b) file 
download; (c) repair. We use eight randomly generated 
files from 1MB to 5OOMB as the data set. We set the path 
of a chosen repository to a non-existing location to simu- 
late a node failure in repair. Note that there are two types 
of repair for RAID-6, depending on whether the failed 
node contains a native chunk or a code chunk. 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


269 


270 









































































































































































































6 
Sa aaah RAID-6 | 
oo 4- Rese F-MSR &xxxx] | 
wn KS 
c9QO 3F Kx 
oO. Exe 
a2 27 Kel 1 
QE 4 | RO) tf r 
7 Bs eS Bg 
10 2 1 
File size (MB) 
(a) File upload 
3 
Ss 25 | RAID-6 | 
<a F-MSR xxxxx) 
oo 2 | | 
n & 
<9 1.5 
rome 
a? 1 
i “e 
- 0 
10 5 2 1 
File size (MB) 
(b) File download 
a RAID-6 (code chunk repair) 
~S51 SK RAID-6 (native chunk repair) | 
Be Be ; F-MSR mxxag 
28 4| Se = ae 
Soe 7 oe } 
Om 2+ Koes 
oo Leo 
e le ooo 
@) DO! 
10 5 2 1 
File size (MB) 
(c) Repair 


Figure 5: Response times of NCCloud on Azure. 


Figure 3 shows the response times of all three op- 
erations (with 95% confidence intervals plotted), and 
Figure 4 shows five key constituents of the response 
time when dealing with a 500MB file. Figure 3 shows 
that RAID-6 has less response time in file upload and 
download. With the help of Figure 4, we pinpoint the 
overhead of F-MSR over RAID-6. Due to having the 
same MDS property, RAID-6 and F-MSR exhibit similar 
data transfer time during upload/download. However, F- 
MSR displays a noticeable encoding/decoding overhead 
over RAID-6. When uploading a 500MB file, RAID-6 
takes 1.490s to encode while F-MSR takes 5.365s; when 
downloading a 500MB file, no decoding is needed in the 
case of RAID-6 as the native chunks are available, but 
F-MSR takes 2.594s to decode. 

On the other hand, F-MSR has slightly less response 
time in repair. The main advantage of F-MSR is that it 
needs to download less data during repair. In repairing 
a SOOMB file, F-MSR spends 3.887s in download, while 
the native-chunk repair of RAID-6 spends 4.832s. 

Although RAID-6 generally has less response time 
than F-MSR in a local cloud environment, we expect that 
the encoding/decoding overhead of F-MSR can be easily 
masked by network fluctuations over the Internet, as will 
be shown next. 


5.2.2. Ona Commercial Cloud 


The following experiment is carried out on a machine 
with Intel Xeon E5530 and 16GB RAM running 64-bit 
Ubuntu 9.10. We repeat the three operations in Sec- 
tion 5.2.1 on four randomly generated files from 1MB to 
1OMB atop Windows Azure Storage. On Azure, we cre- 
ate (n+1) = 5containers to mimic different cloud repos- 
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itories. The same operation for both RAID-6 and F-MSR 
are run interleaved to lessen the effect of network fluctu- 
ation on the comparison due to different times of the day. 
Figure 5 shows the results for different file sizes with 
95% confidence intervals plotted. Note that although we 
have used only Azure in this experiment, actual usage of 
NCCloud should stripe data over different vendors and 
locations for better availability guarantees. 

From Figure 5, we do not see distinct response time 
differences between RAID-6 and F-MSR in all opera- 
tions. Furthermore, on the same machine, F-MSR takes 
around 0.150s to encode and 0.064s to decode a LOMB 
file (not shown in the figures). These constitute roughly 
3% of the total upload and download times (4.962s and 
2.240s respectively). Given that the 95% confidence in- 
tervals for the upload and download times are 0.550s 
and 0.438s respectively, network fluctuation plays a big- 
ger role in determining the response time. Overall, we 
demonstrate that F-MSR does not have significant per- 
formance overhead over our baseline RAID-6 implemen- 
tation. 


6 Related Work 


There are several systems proposed for multiple-cloud 
storage. HAIL [5] provides integrity and availabil- 
ity guarantees for stored data. DEPSKY [4] addresses 
Byzantine Fault Tolerance by combining encryption and 
erasure coding for stored data. RACS [1] uses erasure 
coding to mitigate vendor lock-ins when switching cloud 
vendors. It retrieves data from the cloud that is about 
to be failed and moves the data to the new cloud. Un- 
like RACS, NCCloud excludes the failed cloud in repair. 
All the above systems are based on erasure codes, while 
NCCloud considers regenerating codes with an emphasis 
on storage repair. 

Regenerating codes (see survey [9]) exploit the opti- 
mal trade-off between storage cost and repair traffic. Ex- 
isting studies mainly focus on theoretical analysis. Sev- 
eral studies (e.g., [10, 13, 19]) empirically evaluate ran- 
dom linear codes for peer-to-peer storage. However, 
their evaluations are mainly based on simulations. NCFS 
[17] implements regenerating codes, but does not con- 
sider MSR codes that are based on linear combinations. 
Here, we consider the F-MSR implementation, and per- 
form empirical experiments in multiple-cloud storage. 


7 Conclusions 


We present NCCloud, a multiple-cloud storage file sys- 
tem that practically addresses the reliability of today’s 
cloud storage. NCCloud not only achieves fault tolerance 
of storage, but also allows cost-effective repair when a 
cloud permanently fails. NCCloud implements a practi- 
cal version of the functional minimum storage regenerat- 
ing code (F-MSR), which regenerates new chunks during 
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repair subject to the required degree of data redundancy. 
Our NCCloud prototype shows the effectiveness of F- 
MSR in accessing data, in terms of monetary costs and 
response times. The source code of NCCloud is available 
at http://ansrlab.cse.cuhk.edu.hk/software/nccloud. 
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Abstract 


I/O traces are good sources of information about real- 
world workloads; replaying such traces is often used to 
reproduce the most realistic system behavior possible. 
But traces tend to be large, hard to use and share, and 
inflexible in representing more than the exact system 
conditions at the point the traces were captured. Often, 
however, researchers are not interested in the precise de- 
tails stored in a bulky trace, but rather in some statisti- 
cal properties found in the traces—properties that affect 
their system’s behavior under load. 

We designed and built a system that (1) extracts many 
desired properties from a large block I/O trace, (2) builds 
a statistical model of the trace’s salient characteristics, 
(3) converts the model into a concise description in the 
language of one or more synthetic load generators, and 
(4) can accurately replay the models in these load gener- 
ators. Our system is modular and extensible. We exper- 
imented with several traces of varying types and sizes. 
Our concise models are 4—6% of the original trace size, 
and our modeling and replay accuracy are over 90%. 


1 Introduction 


Traces are a time-honored way to collect information 
about real-world workloads. The information contained 
in traces allows a workload to be characterized using fac- 
tors such as the exact size and offset of each I/O request, 
read/write ratio, ordering of requests, etc. By replaying 
a trace, users can evaluate real-world system behavior, 
optimize a system based on that behavior, and compare 
the performance of different systems [21, 23, 25, 34]. 

Despite the benefits of traces, they are hard to use in 
practice. A trace collected on one system cannot easily 
be scaled to match the characteristics of another. It is dif- 
ficult to modify traces systematically, e.g., by changing 
one workload parameter but leaving all others constant. 
Traces are hard to describe and compare in terms that are 
easily understood by system implementors. Large trace 
files are time-consuming to distribute and can affect the 
system’s behavior during replay by polluting the page 
cache or causing an I/O bottleneck [20]. 

In reviewing related work, we observed that in many 
cases replaying the exact trace is not required. Instead, 
it is often sufficient to use a synthetic workload gener- 
ator that accurately reproduces certain specific proper- 
ties. For example, a particular system might be more 
sensitive to the read-write ratio than to operation size. 
In this situation one does not really need to replay the 
trace precisely; a synthetic workload that emulates that 


read-write ratio would suffice. Of course, this example 
is simplistic, and in many cases one would be interested 
in more complex combinations of the workload parame- 
ters. However, the general idea that only some properties 
of the trace affect system behavior remains valid. 

Because many systems respond only to a few pa- 
rameters, researchers have developed many benchmarks 
and synthetic workload generators, such as [Ozone [7], 
Filebench [12], and Iometer [33], which avoid many 
of the deficiencies of traces. But it can be difficult to 
configure a benchmark so that it produces a realistic 
workload; simple ones are not sufficiently flexible, while 
powerful ones like Filebench offer so many options that 
it can be daunting to select the correct settings. 

In this work we propose to fill the gap between traces 
and benchmarks by converting traces into the languages 
of the benchmarks. We focus here on block traces due to 
their relative simplicity, but we plan to extend this work 
to other trace types, e.g., file system and NFS. 

Our system creates a universal representation of the 
trace, expressed as a multi-dimensional matrix in which 
each dimension represents the statistical distribution of 
a trace parameter or a function. Each parameter is cho- 
sen to represent a specific workload property. We imple- 
mented the most commonly used properties, such as I/O 
size, inter-arrival time, seek distance, read-write ratio, 
etc. End users can easily add new ones as desired. For 
each benchmark, a small plugin converts the universal 
trace matrix into the specific benchmark’s language. 

Many workloads vary significantly during the tracing 
period. To address this issue, our system supports trace 
chunking across time. Within each chunk, the workload 
is considered to be stable and uniform and is expressed 
as a Separate matrix. We use chunk deduplication to save 
space in periods where the workload is the same. 

We evaluated the accuracy of our system by generat- 
ing models from several publicly available traces. We 
first replayed each trace on a test system, observing 
throughput, latency, I/O queue length and utilization, 
power consumption, request sizes, CPU and memory us- 
age, and the numbers of interrupts and context switches. 
Then we emulated the trace by running benchmarks with 
generated parameters on the same system, collected the 
same observations, and compared the results. 

Our error was less than 10% on average, and 15% at 
most; it can be controlled by varying several parameters. 
For a basic set of metrics, we converted a 1.4GB trace to 
the Filebench language in only 30s. The resulting trace 
description was 60MB, or 23.3 smaller. 
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2 Background and Motivation 


Statistics Matter. Trace replay is a common evalua- 
tion technique because, unlike any other testing method, 
by definition traces represent reality. However, this real- 
ism comes at a price: the trace represents one instance of 
one system at one point in time. The next day’s workload 
will inevitably be different, as will the same workload on 
a system with different hardware, competing workloads, 
etc. In the worst case, these variations might cause a sys- 
tem to be unintentionally optimized for an atypical oper- 
ating point. Even if a trace accurately represents a target 
workload, rapid changes in hardware performance make 
it difficult to evaluate a design on a modern machine us- 
ing measurements and traces captured on a different sys- 
tem only a few years earlier. 

Our key observation is that for many purposes, statis- 
tics are what matter. The exact ordering of operations, 
their precise timing, the blocks or files accessed, and 
many other details recorded in a trace are variable and 
would change if it were re-recorded. Thus, when we re- 
play a trace, we do not necessarily want to reproduce 
every detail as precisely as possible; instead, we would 
like to accurately represent its statistical properties. 

An advantage of thinking of traces statistically is that 
they become much more flexible. For example, a trace 
collected a decade ago would record accesses to only 
a fraction of the blocks on a modern disk, and at a very 
different rate. Compared to a bulky trace, a statistical de- 
scription is much simpler to scale to a modern machine 
and therefore provides a convenient abstraction for per- 
forming systematic evaluation of many systems. 

Generating a good description requires representative 
trace properties to be selected. In general, the most ap- 
propriate properties depend on the system being tested, 
so it is impossible to create a complete list. For most 
purposes, however, the parameters of interest are well 
defined and widely adopted, e.g., I/O rate and distribu- 
tion, read/write ratio. Thus, a statistical model of a trace 
should be able to capture those parameters, and should 
be able to describe them in sufficient detail so that no 
important information is lost. In particular, we should 
not reduce complex, empirically observed distributions 
to overly simple mathematical models, such as Poisson 
arrival processes, without justification. 

Some workloads may also exhibit nonstandard, or 
even undiscovered, properties that might alter system 
behavior. It is therefore advisable to preserve the orig- 
inal traces to ensure these properties are retained. A 
workload generator can be adapted to include such char- 
acteristics once they are identified. 


System Response. ‘To evaluate a system empirically, 
workloads are applied and appropriate metrics measure 
its response. Performance is often characterized by 
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throughput, latency, CPU utilization, I/O queue length, 
and memory usage [39,45]. Power consumption charac- 
terizes energy efficiency [29, 36]. 

In many papers, these metrics are summarized by 
statistics such as averages or distributions. But as we 
argue above, it is often possible to accurately evaluate 
these metrics without resorting to a full and detailed 
trace replay. If the system response to a trace emula- 
tion is similar to that of a full replay, then emulation can 
replace full replay without biasing the results. 

To evaluate the accuracy of our trace extraction and 
modeling system, we surveyed papers in Usenix FAST 
conferences from 2008-2011 and noted that the fre- 
quently used metrics fell into four categories: (1) 
throughput and latency; (2) I/O utilization and average 
I/O queue length; (3) CPU utilization and memory us- 
age; and (4) power consumption. Most of the surveyed 
papers included 1—2 of these metrics, but in our study we 
evaluate all four types to ensure a comprehensive com- 
parison. We claim that if all response metrics are similar, 
then the trace is modeled properly. We feel that our set 
of metrics is sufficiently representative and comprehen- 
sive to produce reliable results. There is still a chance 
that an unmeasured response parameter may differ; but 
our system is modular and easily extensible to emulate 
any additional metrics one desires. 


Replay Methods. We use system response to evaluate 
our trace emulation accuracy. However, a system’s re- 
sponse depends on the replay method, and varies based 
on the goal of the study. To study peak performance, 
traces are often accelerated [31, 40, 44,48]. For power 
efficiency, traces are usually replayed verbatim to pre- 
serve realistic idle periods [5,9]. To stress specific sub- 
systems, a subset of the trace is sometimes replayed [38]. 
Our workload models can emulate existing trace-replay 
methods as well as more sophisticated ones. 


3 Design 
Our five design goals, in decreasing priority, are: 


1. Accuracy: Ensure that trace replay and trace emu- 
lation yield matching evaluation results. 

2. Flexibility: First, leverage existing powerful work- 
load generators, rather than creating new ones. 
Therefore, traces should be translated into models 
that can be accurately described using the capabili- 
ties of existing benchmarks. Second, allow users to 
choose anything from accurate yet bulky models to 
smaller but less precise ones. 

3. Extensibility: Allow the model to include addi- 
tional properties chosen by the user. 

4. Conciseness: The resulting model should be much 
smaller than the original trace. 

5. Speed: The time to translate large traces should be 
reasonable even on a modest machine. 


USENIX Association 


USENIX Association 


Feature Extraction. The first step in our model- 
building process is to extract important features from 
the trace. We first discuss how we extract parameters 
from workloads whose statistical characteristics do not 
change over time, i.e., stationary workloads. Then we 
describe how to emulate a non-stationary workload. 

Each block trace record has a set of fields to describe 
the parameters of a given request. Fields may include the 
operation type, offset or block number, I/O size, times- 
tamp, etc. Our translator is field-oblivious: it considers 
every parameter as a number. We designate these param- 
eters as an n-dimensional vector p = (p1,p2,.--;Dn)- 
We define a feature function vector on p: 


f= (fi (Dp, S81), fo(p, S2), dep aD. a = f (P, Sf) 


Each feature function represents an analysis of some 
property of the trace; s; represents private state data for 
the 2-th feature function, which lets us define features 
across multiple trace entries and parameters. 

For example, assume that p; and p2 represent the I/O 
size and offset fields, respectively. We can then define 
the simple feature functions {;—just the I/O size itself— 
and f2—the logarithmic inter-arrival distance (offset dif- 
ference between two consecutive requests): 


A= filp, 81) = P1 


fe = fo(p, $2) = log(p2 — s2.prev_offset) 


In our translator, the user first chooses a set of m fea- 
ture functions. Evaluating these functions on a single 
trace record results in a vector that represents a point in 
an m-dimensional feature space. The translator divides 
the feature space into buckets of user-specified size, and 
collects a histogram of feature occurrences in a multi- 
dimensional matrix—the feature matrix—that explicitly 
captures the relevant statistics of the workload, and im- 
plicitly records their correlations. 

For example, using the two feature functions above, 
plus a third that encodes the operation (0 for reads, 1 for 
writes), the resulting feature matrix might look like the 
one in Figure |. In this case, the trace held 52 requests 
of size less than 4KB and inter-arrival distance less than 
1KB; of those, 38 were reads and 14 were writes. 

By choosing a set of feature functions, users can ad- 
just the workload representation to capture any impor- 
tant trace features. By selecting an appropriate bucket 
granularity, users can control the accuracy of the repre- 
sentation, trading off precision for computational com- 
plexity in the translator and matrix size. Stage | in Fig- 
ure 2 shows the translator’s role in the overall design. 

Once the feature matrix has been created, the transla- 
tor can perform a number of additional operations on it: 
projection, summation along dimensions, computation 
of conditional probabilities, and normalization. These 


operations can be used by the benchmark plugins (de- 
scribed below) to calculate parameters. For example, 
using the matrix in Figure 1, a plugin might first sum 
across the distance-vs.-size plane to calculate the total 
numbers of reads and writes, normalize these to find 
P(read), and then generate benchmark code to condition- 
alize I/O size on the operation type. 


Clearly, the choice of feature functions affects the 
quality of the emulation; currently the investigator must 
do this based on the insight into the particular system of 
interest, e.g., whether it has been optimized for certain 
workloads that can be reflected in an appropriate fea- 
ture function. We have implemented a library of over 
a dozen standard feature functions based on those com- 
monly found in the literature [10, 11, 26, 30], including 
operation type, I/O size, offset distribution, inter-arrival 
distance, inter-arrival time, process identifier, etc. New 
feature functions can easily be added as needed to cap- 
ture specialized system characteristics. 


Benchmark Plugins. Once a feature matrix has been 
constructed from a trace, it is possible to use it directly as 
input to a workload generator. However, our goal in this 
research is not to create yet another generator. Instead, 
we believe that it is best to build on the work of others 
by using existing workload generators and benchmarks. 
This approach allows us to easily reuse all the exten- 
sive facilities that these benchmarks provide. Many ex- 
isting benchmarks offer a way to configure the workload 
that they generate; some offer command-line configura- 
tion parameters (e.g., [Ozone [7] and Iometer [33]) while 
others offer a more extensive language for that purpose 
(e.g., Filebench [12] and fio [13]). 

Most existing benchmarks use statistical models to 
generate a workload. Some of them use average parame- 
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Figure 1: Workload representation using a feature matrix 
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Figure 2: Overall System Design 


ter values; others use more complex distributions. In all 
cases, our feature matrices contain all the information 
needed to control the models used by these benchmarks. 
A simple plugin translates the feature matrix into a spe- 
cific benchmark’s parameters or language. For some 
benchmarks, the expressiveness of the parameters might 
limit the achievable accuracy, but even then the plugin 
will help choose the best settings to emulate the original 
trace’s workload. Stage 3 in Figure 2 demonstrates the 
role of the benchmark plugins in the overall design. 


For our initial investigations, we have implemented 
plugins for Filebench and IOzone. We chose Filebench 
for its flexibility, and IOzone because it is more suitable 
for micro-benchmarking. We found that it was easy to 
add a plugin for a new benchmark, since only a single 
function has to be registered with the translator. The 
size of the function depends on the number of feature 
functions and the complexity of the target benchmark. 


Chunking. Many real-world traces are non-stationary: 
their statistical characteristics vary over time. This is es- 
pecially true for traces that cover several hours, days, 
or weeks. However, most workload generators apply a 
stationary load, and cannot vary it over time. We ad- 
dress this issue with trace chunking: splitting a trace 
into chunks by time, such that the statistics of any given 
chunk are relatively stable. Finding chunk boundaries is 
difficult, so we first use a constant user-defined chunk 
size, measured in seconds. For each chunk, we compute 
a feature matrix independently; this results in a sequence 
of matrices. We then convert these fixed chunks into 
variable-sized ones by feeding the matrices to a dedupli- 
cator that merges adjacent similar matrices (Stage 2 in 
Figure 2). This optimization works well because many 
traces remain stable for extended periods before shifting 
to a different workload mode. We normalize the matri- 
ces before comparing them, so that the absolute number 
of requests in a chunk does not affect the comparison. 
We use the maximum distance between matrix cells as a 
metric of similarity. When two matrices are found to be 
similar, we average their values and use the result to rep- 
resent the workloads in the corresponding time chunks. 


Besides detecting varying workload phases, the dedu- 
plication process also reduces the model size. To achieve 
even further compression, we support all-ways dedupli- 
cation: every chunk in a trace is deduplicated against 
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every other chunk (not just adjacent ones). 

Along with the matrices, we generate a time-to- 
matrices map that serves as an additional input to the 
benchmark plugins. If the target benchmark is unable 
to support a multi-phase workload, the plugin generates 
multiple invocations with appropriate parameters. 

In the example in Figure 2, we set the trace duration 
to 60s and the initial chunk size to 10s, so the transla- 
tor generated six matrices. After all-ways deduplication, 
only two remained. 


4 Implementation 


Traces from different sources often have different for- 
mats. We wanted our translator to be efficient and 
portable. We chose the efficient and flexible DataSeries 
format [2]—recommended by the Storage Networking 
Industry Association (SNIA)—and we selected SNIA’s 
draft block-trace semantics [37]. We wrote converters 
to allow experimentation with existing traces in other 
formats. We also created a block-trace replayer for 
DataSeries, which supports several commonly used re- 
play modes. In total we wrote about 3,700 LoC: 1,500 
in the translator, 800 in the converters, 1,000 in the 
DataSeries replayer, and 400 in the Filebench and IO- 
zone plugins. We plan to release these publicly. 


5 Evaluation 


To evaluate the accuracy, conversion speed, and com- 
pression of our system, we used multiple micro- 
benchmarks and a variety of real traces. In this paper 
we present evaluation results based on two traces: Fi- 
nancel [28] and MS-WBS [22]. The Financel trace 
captures the activity of several OLTP applications run- 
ning at two large financial institutions. The MS-WBS 
traces were collected from daily builds of the Microsoft 
Windows Server operating system. The high-level char- 
acteristics of the traces are presented in Table 1. 

It is fair to assume that the accuracy of our transla- 
tor might depend on the system under evaluation. In 
our experiments we used a spectrum of block devices: 


Characteristic MS-WBS 
Duration 


Reads/Writes (10°) 1.2/4.1 0.7/0.6 


Avg I/O size 3.5KB 20KB 
Seq. Requests 


Table 1: High-level characteristics of the used traces 
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Figure 3: Reads and writes per second, Setup P, Fin! trace. 
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Figure 5: Memory and CPU usage, Setup P. Fin! trace. 


various disk drives, flash drives, RAIDs, and even vir- 
tual block devices. In this paper we present results from 
two extremes of the spectrum. In the first experimental 
setup—Setup P—we used a Physical machine with an 
external SCSI Seagate Cheetah 300GB disk drive con- 
nected through an Adaptec 39320 controller. The fact 
that the drive was powered externally allowed us to mea- 
sure its power consumption using a WattsUp meter [43]. 

The second experimental setup (Setup V) is an 
enterprise-class system that has a Virtual machine run- 
ning under the VMware ESX 4.1 Hypervisor. The 
VM accesses its virtual disks on an NFS server backed 
by a GPFS parallel file system [19, 35]. The VM 
runs CentOS 6.0; the ESX and GPFS servers are IBM 
System x3650’s, with GPFS using a DS4700 storage 
controller. Accuracy metrics were recorded at the 
NFS/GPES server. 

On both setups, we first replayed traces and then emu- 
lated them using Filebench. In all experiments we set the 
chunk size to 20s and enabled all feature functions. We 
chose the matrix granularity for each dimension exper- 
imentally, by gradually decreasing it until the accuracy 


began to drop. During all runs we collected the accuracy 
parameters specified in Section 2 using the iostat, vm- 
stat, and wattsup tools; we plotted graphs showing the 
value of each accuracy parameter versus time for both 
replay and emulation. Due to limited space, we only 
present the graphs for a few representative accuracy pa- 
rameters. However, we give the average and maximum 
emulation error for all experiments. 

Figure 3 depicts how the throughput—for both reads 
and writes—changes with time for the Financel trace. 
The replay was performed with infinite acceleration; it 
took about 5 hours to complete on Setup P. The trace 
emulation line closely follows the replay line; the Root 
Mean Square (RMS) distance is lower than 6% and the 
maximum distance is below 15%. In the beginning of 
the run, read throughput was 4 times higher then later 
in the trace. By inspecting the model we found that 
the workload exhibits high sequentiality in the begin- 
ning of the trace. After startup, the read throughput falls 
to 50-100 ops/s, which is reasonable for an OLTP-like 
workload and our hardware. The write performance is 
2—2.5 times higher than for read, due to the controller’s 
write-back cache that makes writes more sequential. 

Figure 4 depicts disk-drive power consumption in 
Setup P during a 10-minute non-accelerated replay and 
emulation of the MS-WBS trace. In the first 5 min- 
utes trace activity was low, resulting in low power usage. 
Later, a burst of random disk requests increased power 
consumption by almost 40%. The emulation line devi- 
ates from the replay line by an average of 6%. 

In Setup V, the GPFS server was caching requests 
coming from a virtual machine. As a result, the run time 
of the Finl trace was only 75 minutes. The memory and 
CPU consumption of the GPFS server during this time 
are shown in Figure 5. Memory usage rises steadily, in- 
creasing by about 5OOMB by the end of the run, which is 
the working-set size of the Finl trace. Discrepancies be- 
tween replay and emulation are within 10%, but there are 
visible deviations at times when the memory usage steps 
up. We attribute this to the complexity of the GPFS’s 
cache policy, which is affected by a workload parame- 
ter that we did not emulate. CPU utilization remained 
steadily about 10% for both replay and emulation. 

Figure 6 summarizes the errors for all parameters, for 
both setups and traces. The maximum emulation error 
was below 15% and RMS distance was 10% on average. 
Although the maximum discrepancy might seem high, 
Figure 3 shows sufficient behavioral accuracy. 

The selection of feature matrix dimensions is vital for 
achieving high accuracy. If a system is sensitive to a 
workload property that is missing in the feature matrix, 
accuracy can suffer. For example, disk- and SSD-based 
storage systems may have radically different queuing 
and prefetching policies. To ensure high-fidelity replays 
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Figure 6: Root Mean Square (RMS) and maximum relative distances of accuracy parameters for two traces and two systems. 


across both types of systems, the feature matrix should 
capture the impact of appropriate parameters. 

The chunk size and matrix granularity also affect the 
model’s accuracy. Our general strategy is to select these 
parameters liberally at first (e.g., 100s chunk size and 
IMB granularity for I/O size) and then gradually and 
repeatedly restrict them (e.g., 10s chunk size, 1KB I/O 
size) as needed until the desired accuracy is achieved. 
One can always be guaranteed to get high enough accu- 
racy if sufficiently small numbers are used. 


Conversion Speed and Model Size. The speed of 
conversion and the size of the resulting model depend 
on the trace length and the translator parameters. On our 
2.5GHz server, traces were converted at about 5OMB/s, 
which is close to the throughput of the 7200RPM disk 
drive. The resulting model without deduplication was of 
approximately 10—15% size of the original trace. Dedu- 
plication removed over 60% of the chunks in both the 
Finl and MS-WBS traces, resulting in a final model size 
reduction of 94-96%. All sizes were measured after 
compressing both traces and models using gzip. 


6 Related Work 


The body of research related to traces is large; we cite 
only a representative sample. Many studies have fo- 
cused on accurate trace collection with minimum inter- 
ference [1, 4, 24, 31, 32]. Other researchers have pro- 
posed trace-replaying frameworks at different layers in 
the storage stack [3,20,48,48,49]. Since a trace contains 
information about the workload applied to the system, a 
number of works focused on trace-driven workload char- 
acterization [22, 23, 25,34]. N. Yadwadkar proposed to 
identify an application based on its trace [46]. 

After a workload is characterized, a few researchers 
have suggested a workload model that allows them to 
generate synthetic workloads with identical characteris- 
tics [6, 14-18, 41,42,47]. These works address only one 
or two workload properties, whereas we present a gen- 
eral framework for any number of properties. Also, we 
chunk data and generate workload expressions for the 
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languages of already existing benchmarks. 


The two projects most closely related to ours are Dis- 
tiller [27] and Chen’s Workload Analyzer [8]. Dis- 
tiller’s main goal is to identify important workload prop- 
erties. We can use this information to intelligently de- 
fine dimensions for our feature matrix. Chen uses ma- 
chine learning techniques to identify the dependencies 
between workload features. However, the authors do not 
emulate traces based on the extracted information. 


7 Conclusions and Future Work 


We have created a system that extracts flexible workload 
models from large I/O traces. Through the novel use of 
chunking, we support traces with time-varying statistical 
properties. In addition, trace extraction is tunable, allow- 
ing model accuracy and size to be traded off against cre- 
ation time. Existing I/O benchmarks can readily use the 
generated model by implementing a plugin. Our eval- 
uation with Filebench and several block traces demon- 
strated that the accuracy of generated models approaches 
95%, while the model size is less than 6% of the original 
trace size. Such concise models allow easy comparison, 
scaling and other modifications. 


In the future we plan to support file-system-level 
traces, build multi-layer models, and add flexibility in 
the analysis phase. Our current chunking method is sim- 
ple and we want to investigate alternative chunking tech- 
niques. We will also work on a graphical tool for manual 
trace chunking. To avoid manual selection of the transla- 
tor’s parameters, we want to explore various artificial 1n- 
telligence approaches. To further reduce the model size, 
we plan to improve the compression ratio by matching 
empirical distributions in the feature matrix to explicit 
mathematical functions. We recognize that our list of ac- 
curacy metrics is not complete and want to experiment 
with other accuracy parameters (e.g., latency distribu- 
tions). We also plan to develop tools and techniques that 
will simplify various operations on our models, such as 
time and size scaling, and comparison to other models. 
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Abstract 


Storage for cluster applications is typically provisioned 
based on rough, qualitative characterizations of applica- 
tions. Moreover, configurations are often selected based 
on rules of thumb and are usually homogeneous across a 
deployment; to handle increased load, the application is 
simply scaled out across additional machines and storage 
of the same type. As deployments grow larger and stor- 
age options (e.g., disks, SSDs, DRAM) diversify, how- 
ever, current practices are becoming increasingly ineffi- 
cient in trading off cost versus performance. 

To enable more cost-effective deployment of cluster 
applications, we develop scc—a storage configuration 
compiler for cluster applications. scc automates clus- 
ter configuration decisions based on formal specifica- 
tions of application behavior and hardware properties. 
We study a range of storage configurations and iden- 
tify specifications that succinctly capture the trade-offs 
offered by different types of hardware, as well as the 
varying demands of application components. We ap- 
ply scc to three representative applications and find that 
scc 1S expressive enough to meet application Service 
Level Agreements (SLAs) while delivering 2—4.5 x sav- 
ings in cost on average compared to simple scale-out 
options. scc’s advantage stems mainly from its ability 
to configure heterogeneous—rather than conventional, 
homogeneous—cluster architectures to optimize cost. 


1 Introduction 


Today, application providers can choose from a range of 
storage choices to provision the infrastructure for cluster- 
based applications. Storage technologies as diverse as 
DRAM, solid state drives (SSDs), and hard disks present 
complex trade-offs in cost, capacity, performance (along 
multiple dimensions), and power consumption. New 
storage technologies such as phase change memory [14] 
will soon further complicate the space. 

Provisioning, however, is based largely on rules of 
thumb and best practices. Applications are broadly cat- 


egorized as storage, compute, or memory intensive and 
are typically deployed on homogeneous clusters heavy 
on the corresponding resource. As application load in- 
creases, deployments are “scaled out” by simply adding 
more storage and compute in the same configuration. 
Not only does this state of affairs fail to take full ad- 
vantage of the diversity of available storage choices, but 
the increasing scale of deployments makes such ineffi- 
ciencies worse; inefficiencies multiplied over thousands 
of servers can have substantial costs. In the scale-out 
model, a poor initial choice can greatly inflate expenses. 

In this paper, we pursue an alternate approach—the 
automated selection of cluster storage configurations 
based on formal specifications of applications, hardware, 
and workloads. Initially, such an approach places signif- 
icant burden on those developing and deploying applica- 
tions to characterize applications and workloads. How- 
ever, the resultant savings in cost necessary to satisfy Ser- 
vice Level Agreements (SLAs) can be substantial. 

Our primary contributions in implementing this ap- 
proach are two-fold. First, we determine how the charac- 
teristics of applications, workloads, and hardware should 
be specified in order to automate the selection of cluster 
configurations. To do so, we study several representative 
deployment scenarios and identify a parsimonious yet 
sufficiently expressive set of parameters that capture the 
trade-offs offered by different types of storage devices 
and the varying demands across application components. 
Though others have pursued a similar approach of for- 
mally specifying workloads and hardware [5, 7, 34], we 
extend this approach to account for various types of stor- 
age media (e.g., disk, SSD, and DRAM) and to jointly 
capture storage and compute requirements of applica- 
tions. We show that it 1s feasible to concisely summarize 
the most salient parameters that determine the resource 
requirements of specific application deployments, thus 
minimizing the burden of formal specification. 

Second, we develop scc, a storage configuration com- 
piler that takes specifications of applications, workloads, 
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Cost 


72K Disk | 90(R) | 125(R) 
S00 ” 


90 (W) 125 (W) 


tas) | tsocw | 2880 | 2 | S 
(146GB) | 150(W) | 285 (W) 
ras) | soaw) | toon | 2 | S 
(32 GB) so(w) | 1000(W)| ~ 
DRAM | 12.8K (R) 
123K W) 


1.6B (R) 
L.6B (W) 35 $35 


-cPucore | - | - | 0 | Si 
Server type C 


Server] 4 cores, 1 Gbps network $ 
OST | nap beane’ SAS sos 

Server2 16 cores, 10 Gbps network 
OSS | asc RAM 16888 ss 
32 cores, 10 Gbps network 


512GB DRAM, 16 SAS slots 
Table 1: Example set of cluster building blocks input to scc. 





Cost is price plus energy costs for 3 years. scc takes read and 
write gap parameters as input rather than IOPS. 


and hardware as input, automates the navigation of the 
large space of storage configurations, and zeroes in on 
the configuration that meets application SLAs at mini- 
mum cost. To evaluate scc, we experiment with three 
distributed applications with distinctly different work- 
load characteristics: 1) ProductSearch, a product search 
webservice modeled on Google Merchant Center [17], 2) 
Terasort, a MapReduce job to sort large tuple collections, 
and 3) PhotoShare, a photo-sharing Web service modeled 
on Flickr. By deploying these applications on a range of 
cluster configurations and measuring application perfor- 
mance on these configurations, we present empirical ev- 
idence that scc is expressive enough to capture the needs 
of a range of applications. 


In developing scc and applying it to diverse applica- 
tion workloads, we make three key observations. First, 
the right choice of storage configuration depends not 
only on the storage capacity and I/O needs of the ap- 
plication, but also on the application’s compute require- 
ments and on the types of server configurations available. 
When an application performs a set of operations in se- 
quence, the resources assigned to serve each of these op- 
erations must be jointly optimized to satisfy the perfor- 
mance bound on the sequence of operations at minimum 
cost. For example, in an application that performs an 
I/O operation on some data followed by some compu- 
tation, the storage type assigned to the data depends on 
the amount of computation. When the computation con- 
sumes significant time, the data may need to be stored 
on fast storage like SSDs to meet performance bounds, 
whereas when compute time is low, there is greater slack 
in performing the I/O and hence, slower cheaper storage 
like disk may suffice. 
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Figure 1: Overview of scc. 


Second, we find that clusters with heterogeneity— 
rather than conventional homogeneity—across servers 
are necessary to optimize cost. The resources required 
differ across application components because of varying 
ratios of capacity, compute, and I/O throughput needs 
across components. For example, in a deployment of the 
photo-sharing Web service, it may be cheaper to store 
photos on disk and cache thumbnails in DRAM; stor- 
ing both on disk or both in DRAM may result in higher 
cost due to higher I/O throughput needs from thumbnails 
or higher storage capacity needs of photos, respectively. 
As a result, scc’s suggested configuration meets perfor- 
mance SLAs at low cost. For example, in experiments 
with Terasort, we find that scc meets performance re- 
quirements at 15—20% lower cost than a homogeneous 
configuration recommended based on best practices. 

Finally, we also find that the most cost-effective clus- 
ter architecture depends not only on the application be- 
ing provisioned but also on the workload and perfor- 
mance requirements. Data that was initially capacity- 
bound may become I/O-bound at higher loads, calling 
for shifts from high capacity but slow storage, e.g., disks, 
to low capacity but fast storage, e.g., SSDs. As a result, 
cluster configurations output by scc for ProductSearch 
and PhotoShare result in 2x—4.5x average savings in cost 
compared to similarly performant scale-out options. 


2 Problem setting and overview 


Identifying an appropriate cluster architecture to host a 
large-scale service is often not straightforward. For ex- 
ample, given a set of resources to choose from (e.g., as 
shown in Table 1), an application provider has to answer 
several questions. What storage technologies should be 
employed, and how should data be partitioned across 
them? Where should caching be employed? What types 
of servers should be chosen to house the selected storage 
units? In addition, even if the application’s implemen- 
tation is efficient and there is coarse-grained parallelism 
in the underlying workload, how will algorithmic shifts 
in the application or variations in workload affect the ap- 
propriate cluster architecture? Our goal is to automate 
the process of answering these questions, rather than re- 
lying solely on human judgment. 
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Problem setting. In developing scc, our focus is 
on the typical scenario where a cluster is dedicated to 
a specific application, rather than large-scale data cen- 
ters (e.g., Google, Microsoft) that host a mix of applica- 
tions. scc caters to the common case where an applica- 
tion provider either acquires hardware or uses third-party 
infrastructure to deploy an application. In such cases, the 
question we seek to answer is: what information from 
the infrastructure provider and from the application de- 
veloper is necessary to determine a cost-effective cluster 
configuration that meets performance goals? 

Overview of scc. As shown in Figure 1, scc takes 
three inputs: 1) a model of application behavior, speci- 
fied by the application’s developer, 11) characteristics of 
available hardware building blocks specified by the in- 
frastructure provider, and 111) application performance 
metrics, 1.e., a parameterized service level agreement 
(SLA). Given these inputs, scc computes how cluster 
cost varies as a function of SLA and outputs a low-cost 
cluster configuration that meets the SLA at each point 
in the space. For example, a webservice SLA might 
specify a peak query rate per second. For each poten- 
tial SLA value (e.g., 1000 queries per second), scc de- 
termines a cost-effective cluster architecture capable of 
satisfying the SLA. scc’s output cost vs. SLA value dis- 
tribution helps administrators decide what performance 
can be supported cost effectively. 

Our focus in developing scc is to show how to system- 
atically exploit storage diversity; 1.e, select among differ- 
ent physical media, local and remote storage, and various 
caching strategies. In the future, we plan to extend scc 
to tailor network configurations and choose among CPU 
types. Here, we assume the cluster network can deliver 
uniform bandwidth between all pairs of servers [4] and 
do not address incast-like scenarios [27] that arise due to 
limited packet buffers. Instead, we assume network stor- 
age access is limited only by network adapter speeds. 


3 Inputs to scc 


We now describe how we represent the three inputs to 
scc—SLA specifications, properties of cluster building 
blocks, and application models. Rather than model the 
intricate complexities of algorithms and hardware, scc 
captures aggregate high level statistics that are relevant to 
application and hardware scaling behavior over a broad 
range of scenarios. Towards this end, we identify a key 
set of elements that comprise each of scc’s inputs and the 
corresponding attributes required to describe these ele- 
ments. Figure 2 depicts examples of scc’s three inputs; 
our implementation encodes them in XML. 


3.1 Specifying SLAs 


We consider throughput-based SLAs for two distinct ap- 
plication classes: batch and interactive; we defer sup- 


<sla task=“photoview” rate=“300"> </sla> 
<sla task=“photoupload” rate=“100”> </sla> 
<sla task=“tagview” rate=““100”"> </sla> 


<resources > 
<storage_unit name=“7.2KDisk” capacity=“SOO0GB” bus=“SAS” 
rateR=“90MBps” gapR=“8ms” rateW=“90MBps” gapW=“8ms” 
volatile="0” price=“‘200” power=“S5W”> </storage_unit> 
<storage_unit name=“SSD” capacity=“32GB” bus=“SAS” 
rateR=“250MBps” gapR=“0.4ms” rateW=“80MBps” gapW=“1 ms” 


volatile="0” price="“450” power="2.4W”> </storage_unit> 

<storage_unit name=“DRAM” capacity=“1GB” bus=“DDR3- 1333” 
rateR=“12.8GBps” gapR=“0.6ns” rateW=“12.8GBps” gapW=“0.6ns” 
volatile=“1” price="‘25” power=“3.5W”> </storage_unit> 

... additional storage units ... 

<epu price=“85” power=“20W”> </cpu> 

<server name=“HP DL380 G6” price=“1400” cpus=“4” BW=“1Gbps’’> 
<bus name=“SAS” slots=“4” BW=“6Gbps”> </bus> 
<bus name=“DDR3-1333” slots=“12” BW=“21.3GBps”> </bus> 

</server> 

... additional servers ... 

</resources> 


(b) 


<application> 
<dataset name=“tables_repository” size=“150GB” persistent=“1” 
remote=“1”> </dataset> 
<dataset name=“hot_ratingsdata” size=“1.6GB” persistent=**” 
remote="0”> </dataset> 
<dataset name=“cold_ratingsdata” size=““6.4GB” persistent="“*” 
remote=“0”> </dataset> 
... additional datasets ... 
<task name=“worker’” phase=“exec” memory=“1GB”> 
<io op=“R” dataset=“‘tables_repository” record_size=“800MB” 
num_records=“1” blocking=“‘0”> </io> 
... additional I/O operations ... 
<compute time=“2.2s” blocking=“1” > </compute> 
<dependency task=“queryprocessor’” num-_invocations=“‘1” 
parallel=“1” blocking=“1”> </dependency> 
</task> 
<task name=“‘queryprocessor” phase=“exec” memory=“200MB”> 
<io op=“R” dataset=“hot_ratingsdata” probability=“0.8” 
num_records=““40K” record_size=“4KB” blocking=“0” > </io> 
<io op=“R” dataset=“‘cold_ratingsdata’”’ probability=“0.2” 
num_records=““40K” record_size=“4KB” blocking=“0” > </io> 
<compute time=“0.65s” blocking=“1”> </compute> 
</task> 
</application> 





(C) 


Figure 2: Example specifications of (a) SLAs for PhotoShare, 
(b) hardware resources, and (c) application behavior for a par- 
ticular deployment of ProductSearch. 


porting latency-based SLAs to future work. For batch 
applications, the SLA has two attributes—the job size 
and the required execution time, e.g., for a MapReduce 
job, the SLA specifies the number of records to be pro- 
cessed and the total run time for doing so. scc is more ap- 
plicable for provisioning a new set of VMs for every job 
than for provisioning a shared cluster used for running 
jobs with varying I/O and compute characteristics. For 
interactive applications run as services, each type of re- 
quest is associated with its own performance-based SLA 
that describes its required sustained processing rate. For 
example, in the case of a photo sharing Web service, the 
rates of photo uploads, photo views, and album views 
are each specified as a separate SLA. scc’s SLAs spec- 
ify peak rather than average case throughput. We discuss 
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how scc accounts for temporal variation in Section 6.3. 
3.2 Cluster building blocks 


scc’s second input is a characterization of the set of build- 
ing blocks available for assembling the cluster. We ac- 
count for three types of elements—storage units, CPU 
cores, and servers. To ensure our approach is not tied to 
the characteristics of any particular technology, we em- 
ploy abstract features such as I/O bandwidth and number 
of processor slots as the attributes for these elements. Ta- 
ble | lists sample building blocks used in our evaluation. 


3.2.1 Storage 


Storage resources come in discrete units, e.g., 1 disk 
or 1 stick of DRAM. To differentiate between different 
kinds of storage technologies such as disk, SSDs and 
DRAM, we characterize each unit based on two prop- 
erties: capacity and I/O throughput. Capacity is simply 
the amount of available storage measured in bytes. Rep- 
resenting I/O throughput is more complex; we capture it 
with four attributes—the average rate at which I/O re- 
quests are served and the average latency gap between 
serving successive I/O requests, accounting for both sep- 
arately for reads and writes. The gap parameter captures 
overheads involved with non-sequential I/O, e.g., seeks 
on disks and block erasure on SSDs. We define read 
(write) gap for a particular storage device as the latency 
incurred on average between successive reads (writes) to 
random logical addresses on the device. The latency to 
serve a read (write) request for a chunk of size bytes is 
thus (Ke + gap). We consider gap rather than the com- 
monly used IOPS metric because gap enables us to better 
capture the range of I/O performance regions from small 
to large records. For example, characterizing read per- 
formance on a 7.2K-RPM disk based on IOPS and rate 
works well for 4 KB and 10 MB reads, but fails to cap- 
ture the read throughput with 200 KB reads. In our eval- 
uation, we find that these four attributes—rate and gap 
for reads and writes—suffice to capture the I/O perfor- 
mance of multiple disk types and SSDs. Furthermore, 
we believe these attributes are expressive enough to cap- 
ture the characteristics of phase change memory (PCM) 
and other emerging storage technologies. 

The application-visible performance of a storage 
medium is also influenced by how the chosen file system 
places data. For example, a disk can deliver significantly 
higher write throughput when written to in a log for- 
mat [28]. Therefore, when an application stores a dataset 
on a Storage or file system, we measure I/O rates and gaps 
of each storage unit when using that system to read/write 
data. Further, for each storage unit, we consider two 
other attributes: storage persistence (1.e., whether it pro- 
vides non-volatile storage) and I/O bus type (e.g., SAS 
vs. PCle). 
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3.2.2 Servers and compute 


Servers impose constraints on how storage can be packed 
into a physical box. For each kind of server, we consider 
its memory capacity as well as the properties of its I/O 
controllers. For each I/O controller, we consider the total 
number of units it can support and its maximum avail- 
able I/O bandwidth. For example, a serial attached SCSI 
(SAS) controller permits up to 128 connected disks, yet 
supports a maximum I/O bandwidth of only 6 Gbps, less 
than the total sequential I/O throughput that can be ob- 
tained from 128 disks. Similarly, throughput for remote 
storage is limited by a server’s network interface speed. 
As our focus is on storage complexity in cluster archi- 
tectures, we consider only a single CPU type, ignoring 
trade-offs in compute per unit power [6, 11]. Instead, we 
vary the number of cores per server to extract the level of 
parallelism needed to maximize storage utilization. 


3.2.3 Costs 


Finally, an additional attribute for every element in the 
resource specification is the amortized cost per hard- 
ware unit including both capital and operational outlays. 
In our current implementation, the latter subsumes en- 
ergy costs, ignoring data center costs and administrator 
salaries, and we consider total cluster cost to be a linear 
sum of individual components, which may not necessar- 
ily be true for large quantities. We leave for future work 
discounting the growth of expenses with cluster size and 
accounting for increased operational costs with a higher 
diversity of server configurations in the cluster. 


3.3. Characterizing applications 


Our characterization of applications accounts for two 
aspects—its implementation and the workload in its 
planned deployment. However, unlike previous attempts 
at formally specifying workloads [34], simply account- 
ing for storage capacity needs and the application’s 
stream of I/O operations does not suffice for our pur- 
pose. Instead, to capture an application’s implementa- 
tion, we first ask the application’s developer to describe 
its decomposition into compute and storage components, 
and the interaction between them. For example, Fig- 
ure 3 depicts the components, and the interaction be- 
tween them, for one of the three applications we con- 
sider later in our evaluation—a photo sharing Web ser- 
vice, PhotoShare. Though our approach places the onus 
on application developers to go through the process of 
formally specifying the components of their application, 
an application’s specification is reusable across deploy- 
ments. Some of the characteristics of several applications 
are already captured today [23, 24]. 

Second, we enable those who deploy an application 
to annotate the specification of the application’s archi- 
tecture with properties of the expected workload in their 
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Viewing remote,persistent 
Tags (2 GB) 


Figure 3: Interaction between tasks and datasets in example 
application PhotoShare. Edges between tasks and datasets rep- 
resent I/O with direction differentiating input and output. Dot- 
ted edges indicate task dependencies. 


deployment. To do so, we require that the compute 
and I/O characteristics of an application’s components, 
when subjected to the target workload, be determined 
by running small-scale application benchmarks. Extract- 
ing these properties requires tracing the application’s 
execution—now standard practice in resource-intensive 
performance-critical applications. In the absence of 
built-in tracing support, systems like Magpie [8] can be 
leveraged. 


3.3.1 Tasks and datasets 


scc’s application specification separates the applica- 
tion’s compute and storage requirements into tasks and 
datasets. A task is a specific application functional unit; 
all threads/processes that perform the same function to- 
gether constitute a single task. A dataset is a collection of 
records of the same type with similar I/O access patterns. 

Execution of tasks. To account for how compute time 
and I/O wait time are distributed across a task’s execu- 
tion, we represent each task by its execution path; dif- 
ferent tasks in an application will have different execu- 
tion paths. A task in an interactive application executes 
its execution path for each incoming request, whereas in 
batch applications, a task’s execution path is executed as 
many times as necessary to consume its input. Further, 
since batch jobs can go through multiple phases of exe- 
cution, we require the application developer to tag each 
task with the phase to which it belongs. The cluster can 
thus be provisioned to support the maximal resource re- 
quirement across phases. 

We characterize the execution path of a task as a se- 
quence of three types of operations—compute, I/O, and 
invocations of other tasks. Each of these can be marked 
as either blocking or non-blocking. Compute operations 
are characterized by the amount of time spent perform- 
ing computation on a particular type of CPU. While this 
value can of course vary, we have found that a represen- 


tative average is sufficient to inform scc; we show later 
in Section 6.1 that scc can help evaluate the sensitivity 
of its output to the input values. I/O operations are at- 
tributed with the dataset on which the operation is being 
performed and whether it is a read or write operation. 
Similarly, every task dependency is annotated with the 
invoked task. 

The operations in a task’s execution path may not be 
completely deterministic. For example, an I/O operation 
may hit the cache in some cases but not all, or a remote 
task may need to be invoked only based on the results of 
prior task invocations. To capture such non-determinism, 
every operation has an additional attribute—the proba- 
bility of its execution. This, for example, enables us to 
capture developer knowledge of typical working set sizes 
for individual datasets and the hit rate on the working set. 

Lastly, we also require that each task node be tagged 
with its memory requirements. While some applications 
may use all available memory and garbage collect on de- 
mand, we consider required memory to be the amount 
necessary to maintain performance. Note that this spec- 
ifies memory that scc must allocate for computation be- 
yond any additional DRAM scc provisions as RAM disks 
to store datasets. 

Representing datasets. Next, we account for datasets 
in terms of their I/O bandwidth and capacity require- 
ments. The I/O requirements from a dataset are deter- 
mined by all the I/O operations performed on it, across 
the execution paths of all tasks. We ask that each I/O 
operation be tagged with three attributes—the number 
of records read or written, the number of bytes in each 
record, and whether records are read in parallel. The 
last of these three properties can be specified by the ap- 
plication developer, while the other two depend on the 
workload for which the application is being deployed. 
Again, we find that average values suffice for our tar- 
get throughput-based SLAs. Describing I/O in terms of 
records accounts for the overhead seen between succes- 
sive read/write operations on storage media such as disks 
and SSDs, e.g., from disk seeks. We similarly annotate 
task dependencies with three attributes—the number of 
invocations being performed, whether they are in par- 
allel, and whether the whole dependency is blocking or 
non-blocking. 

Lastly, we account for a dataset’s capacity require- 
ments by requiring that it be tagged with three additional 
attributes: its size, whether it must be persistent, and 
whether the dataset is local or remote. This last attribute 
differentiates between data assumed in the application’s 
implementation to be on a storage unit local to the task 
accessing it as opposed to data that may be stored on a 
storage unit on a different machine in the cluster. Though 
a remote file can be made to appear local by use of sys- 
tems such as NFS, we capture the application developer’s 
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assumption of local storage, since remote access leads to 
higher access latencies. scc leverages this distinction in 
two ways. For a remote dataset, scc explicitly accounts 
for network load resulting from I/O requests and some 
CPU requirements for the machines hosting the dataset. 
Conversely, task-local storage constrains the amount of 
parallelism available on a single machine due to the stor- 
age bandwidth and number of storage unit slots available 
on the node. 

Figure 2(c) presents an example (for another of the 
applications we use in our evaluation, ProductSearch, 
a product search Web service) of the precise format in 
which such an application characterization is specified 
as input to scc. 


4 Implementation of scc 


Next, we describe how scc processes its inputs to gener- 
ate cost-effective cluster configurations. 


4.1 Overview 


scc determines the cost versus SLA distribution for a 
given application deployment by considering the config- 
uration for each point in the distribution independently. 
To compute the cluster configuration for a target SLA, 
scc needs to answer two questions. First, it needs to de- 
termine the architecture of the cluster—for each dataset 
of the application, it must determine the type of media on 
which the dataset should be stored and how to pack the 
storage units into servers. This packing is constrained by 
the number and location of CPUs available to assign to 
the compute tasks that access each dataset. Second, scc 
needs to identify the scale at which this architecture must 
be instantiated to meet the SLA—scale is determined by 
the number of servers, storage units, and CPUs, as well 
as the level of parallelism of each application task. 
Guiding Principles. Two key principles help scc 
identify the right cluster configuration. First, the archi- 
tecture and scale for every application component can be 
determined independently when all operations are per- 
formed asynchronously, but not when some operations 
are synchronous. The SLA for any task only specifies 
the rate at which a task’s execution path must run. In 
the typical case where a task’s execution path contains 
some operations that block others, scc needs to deter- 
mine the “division of labor” across these operations that 
minimizes cost. For example, in a task that reads from 
an input dataset and then writes to an output dataset, in 
order to meet the task’s SLA, it may suffice to provision 
fast storage for any one of the two datasets; provision- 
ing fast storage for both datasets may unnecessarily re- 
sult in higher cost due to storage capacity requirements, 
whereas slow storage for both may incur higher costs 
in satisfying I/O throughput needs. Hence, scc jointly 
determines resource requirements across all application 
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Configuration state: S = (S;,...,S,), where 
S; = storage type assigned to i’” dataset 
for every remote dataset d; 
compute U; = no. of units of S; to meet capacity and 
I/O needs from d; 
for every task 1; 
R; = average runtime of f; 
P; (parallelism of task t;) = SLA(t;) x R; 
for every dataset d; local to tj, 
compute no. of units of S; to meet capacity and 
I/O needs from d; for one instance of f; 
Linear integer program to choose servers 
Variables: 
1. booleans for whether k” server is of j‘” type 
2. V remote dataset d;, no. of units of S; in k” server 
3. V task t;, no. of instances on k!” server 
Constraints: 
Per-server constraints: 
1. On each I/O controller, (no. of storage units < no. of 
slots) and (I/O throughput < bus bandwidth) 
2. (I/O throughput on remote datasets and local datasets 
accessed remotely) < network bandwidth 
3. no. of CPUs < no. of CPU slots 
Per-dataset and per-task constraints: 
1. V dataset d;, (no. of units across all servers = U;) 
2. V task t;, (no. of instances across all servers = P;) 
Objective: 
Minimize cost of (servers + storage units + CPUs) 





Figure 4: Summary of scc’s procedure for determining a cost- 
effective cluster configuration that satisfies target SLAs, given 
a particular assignment of storage types to datasets. 


components. 


Second, since scc is provisioning for peak load, it pre- 
vents Over-provisioning by ensuring that at least one re- 
source 1s bottlenecked on every server at peak load. (If 
the application provider desires to run the cluster at lower 
peak utilization, that can be specified as input.) Based on 
our characterization of hardware, there are four possible 
bottlenecks on each server—1) the number of slots or 
2) the bandwidth on an I/O controller, 3) the number of 
CPU cores, or 4) network bandwidth. 


Algorithm. Driven by the need for joint optimization 
across components, scc represents each point in the state 
space of configurations by the assignment of storage unit 
types to datasets. As a result, if S is the number of stor- 
age choices and D is the number of datasets, scc has to 
search through a space of O(S”) configurations; for each 
dataset, scc can choose any one of the S storage options. 


In cases where the configuration space is too large to 
perform an exhaustive search, scc performs a repeated 
gradient descent search: We start with a randomly cho- 
sen configuration. In each step, we consider all neigh- 
boring configurations—those which differ in exactly one 
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dataset’s storage-type assignment—and move to the con- 
figuration that still meets the SLA with the maximum de- 
crease in cost. We repeat this step until we find a configu- 
ration where all neighbors have higher cost. Since gradi- 
ent descent can lead to a local minimum, we repeat this 
procedure multiple times with different randomly cho- 
sen initial configurations and settle on the minimum cost 
output across the multiple attempts. In our evaluation, we 
have found that repeating the gradient descent 10 times is 
typically sufficient to find a solution close to the global 
minimum. Therefore, even when determining the con- 
figuration to satisfy workloads of tens of thousands of 
queries per second, scc’s running time for any particular 
SLA is within a minute. 

At the heart of scc’s search of the configuration space 
is a procedure—summarized in Figure 4—that, given 
any particular assignment of storage types to datasets, 
determines a cost-effective set of resources to meet the 
target SLAs. In this procedure, scc first determines for 
each remote dataset, 1.e., not local to any task, the num- 
ber of storage units required of the type assigned to the 
dataset in the configuration state. Second, scc determines 
the number of CPUs required by every task and the num- 
ber of storage units of the assigned type needed by the 
task’s local datasets. Finally, it determines the types of 
servers and number of each kind required to minimize 
overall cluster cost. We describe these three steps using 
examples from illustrative applications. 


4.2 Resources for datasets 


A dataset’s storage resources need to satisfy two require- 
ments: capacity and I/O throughput. To determine the 
cheapest storage solution that satisfies both, scc com- 
putes the number of storage units required to satisfy each 
requirement independently and chooses the maximum of 
the two. When the former (latter) is more expensive, 
we call the dataset capacity (I/O) bound. A capacity- 
bound dataset requires storage equal to the dataset’s size 
irrespective of the medium used. Determining the stor- 
age required by a I/O-bound dataset is more involved. 
Though the total capacity of the storage units allocated 
to the dataset need only be equal to the dataset’s size, 
we may need more units—under-utilizing the capacity 
on each of them—to meet throughput demands. 

We compute I/O requirements as follows. As de- 
scribed in Section 3.3, the application characterization 
specifies the record size and the number of records 
read/written in every I/O operation. scc computes the 
overall number of I/O operations that a particular storage 
unit can support based on its rate and gap parameters. 
The SLA combined with the probability attributed to an 
I/O operation fully specifies the required frequency of the 
operation, which in turn determines the number of stor- 
age units required to deliver the performance in parallel. 


For example, when serving requests to view photos in 
PhotoShare, one photo of size 200 KB on average is read 
from the photos dataset on every photo view. If the pho- 
tos dataset were assigned to 1ISK-RPM disk (Table 1), 
which offers a read rate of 150 MBps and a read gap 
of 3.5 ms, it will be able to serve 200 KB-sized reads at 
the throughput of —,,c2?288— , approximately 40 MBps. 


=DO0KB., 3 eae 
T50MBps 1 >> 


Therefore, if the SLA specifies 1000 photo views per 


200KBx1000/s __ : : 
second, = AGNBpS 5 units of 15SK-RPM disks are 


required to satisfy the I/O throughput requirement. 
4.2.1 Task phases 


Not all tasks in an application execute concurrently, e.g., 
the Map and Reduce tasks run in different phases of a 
MapReduce job. Since datasets are subject to I/O opera- 
tions only from tasks executing in a particular phase, scc 
computes the storage needed to meet I/O requirements in 
each phase independently. The storage requirements for 
a dataset during a particular execution phase are com- 
puted as the sum of storage needs across all the I/O op- 
erations made on the dataset by the tasks that run in that 
phase. scc computes the overall I/O-mandated storage 
requirement as the maximum over all phases. 


4.2.2 Caching for higher I/O 


When a dataset is I/O-bound, storing it across units of 
a single type may not always be the cheapest solution. 
I/O throughput of persistent datasets can be improved by 
introducing a second type of storage unit as a caching 
layer. For example, when considering a single storage 
type to service the entire load, the SSD is the most cost- 
effective option for the tags dataset in the PhotoShare ap- 
plication. However, a cheaper solution is to store the per- 
sistent copy of the tags on a 7.2K-RPM disk and to serve 
reads from a cached copy in DRAM. 

scc assumes write-through caching. Persistent storage 
units handle all writes and maintain a persistent copy. 
Units of another type, with higher I/O rates, handle all 
reads. To ensure durability, every write is committed 
to both copies and by default, scc provisions enough 
storage to cache the entire dataset. However, devel- 
oper knowledge of the application’s working set size— 
encoded into the application specification as different ca- 
pacity requirements for the dataset and for the cache— 
can also be used to determine what fraction of the dataset 
is to be cached. To evaluate whether such a solution is 
cost effective, scc computes the costs of both copies of 
the dataset separately and computes their sum. 


4.3. Task Resources 


scc next determines the resource requirements of each 
compute task in three steps. First, it determines the CPU 
utilization of the task. Second, it computes the degree of 
parallelism—1.e., the number of threads/processes of the 
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task—trequired to meet the SLA. Finally, it determines 
the number of storage units required per instance of the 
task for each of the task’s local datasets. 


A task’s CPU utilization is the fraction of its run time 
spent performing computation. scc translates a task’s 
CPU utilization into the corresponding CPU resources 
required by computing the level of parallelism required 
to meet the SLA: if a task’s execution path is to be exe- 
cuted with frequency F and the task’s average run time 
is R, then (F - R) instances of the task are required. The 
value of F' for a task is computed from the SLA for that 
task and other tasks that depend on it; R is computed by 
appropriately summing up the times for compute, I/O, 
and task invocation operations in the task’s execution 
path, taking into account, for each operation, its prob- 
ability and whether it is blocking or non-blocking. 


scc calculates each task’s storage requirements for its 
local datasets based on capacity and I/O throughput re- 
quirements. scc also computes the task’s memory re- 
quirements and the network bandwidth needed for I/O 
accesses to remote storage. scc determines each of these 
three requirements—local storage, memory, and network 
bandwidth—per instance of the task and linearly extrap- 
olates to a target level of parallelism. 


4.4 Optimizing server costs 


Finally, scc optimizes cluster cost by minimizing the 
cost of required servers. Determining the servers re- 
quired to host storage and CPU resources reduces to 
the multi-dimensional vector bin packing problem [12]. 
Each server type is associated with a cost and a vector 
of resource limits, such as the I/O bandwidth of each I/O 
controller and the maximum number of CPUs that the 
server can accommodate. Respecting these resource lim- 
its, CPUs and storage units required by tasks and datasets 
must be placed across servers, while minimizing total 
cost. scc solves this bin-packing problem with a linear 
integer program. 


5 Evaluation 


Next, we demonstrate that scc achieves the right cost ver- 
sus performance tradeoff. Unfortunately, it is difficult to 
select appropriate comparisons. Though there exists a 
large body of work on capacity planning [22], all of it re- 
volves around the question: “Given a cluster architecture 
for an application, how many servers of each type in the 
architecture are necessary?” In contrast, scc minimizes 
cost by determining not only the right scale, but also the 
architecture most suited for a given application deploy- 
ment. Moreover, conversations with major infrastructure 
providers reveal that existing approaches for provision- 
ing cluster applications used in practice are ad-hoc—the 
primary motivation for our work. 
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5.1 Methodology 


We apply scc to three distributed applications with 
disparate workload characteristics to identify the cost- 
versus-SLA tradeoff in each case. To keep the discus- 
sion simple, we fix capacity requirements while vary- 
ing the SLA. For each application, we validate the cost- 
effectiveness of scc’s output for one particular target 
SLA. Though scc readily outputs cluster configurations 
on the scale of tens of thousands of servers, we focus on 
smaller scales for validation so that we can instantiate the 
configurations with hardware we have on hand. Note that 
even at the scale of a few servers, the combination of type 
and quantity for storage, compute, and servers results in 
a very large configuration space. For example, with 5 
servers of type Serverl, over 10!* cluster configurations 
are feasible using the building blocks in Table 1. 

In the absence of prior approaches for principled de- 
termination of cluster architectures, our evaluation com- 
pares configurations output by scc with all possible alter- 
native assignments of datasets to storage types; for each 
alternative, we consider those quantities of hardware re- 
sources to make cost comparable to scc. Here, we present 
results from alternate architectures that come closest to 
matching scc with respect to satisfaction of SLAs. In 
some cases, we also consider alternative architectures at 
the scale required to meet input SLAs and show that they 
incur higher costs than scc. For each experiment, we 
physically provision clusters composed of the building 
blocks provided as input to scc. 

Table 1 summarizes the resources provided as input to 
scc, represented formally as in Figure 2(b). We construct 
our specification for cluster building blocks based on HP 
ProLiant DL380 G6 servers interconnected by a Gigabit 
Ethernet network. In each server (Serverl), we consider 
the resource limitations to be one quad-core Intel Xeon 
processor, four SAS slots, and up to 12 GB of DRAM. 
Each of the SAS slots can support a 7.2K-RPM disk, a 
15K-RPM disk, or an Intel SSD. To evaluate the perfor- 
mance of a given configuration, we turn off CPU cores 
and/or use only a subset of the SAS and DIMM slots. 

For each of the resources, we consider the cost to be 
the amount we paid, excluding support, plus energy costs 
computed based on power usage numbers from product 
data sheets (we assume $0.10/kKWh over a three year de- 
ployment). Though the power drawn by any unit can vary 
from its specification, we study the robustness of our re- 
sults (Section 6.1) and find that they remain unchanged 
even if energy costs increase by a factor of two. 


5.2 Photo sharing 


Our first application, PhotoShare, is an interactive photo 
sharing application. It allows users to upload tagged 
photos, to view thumbnails for photos associated with 
a given tag, and to view the photos. PhotoShare is a 
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Figure 5: Validation of cluster output by scc for particular 
SLA values in the three application cases. 


C++ FastCGI application hosted on lighttpd webservers. 
Uploaded images are thumbnailed and stored, whereas 
tag updates are made via RPCs. Data is kept in a dis- 
tributed log-based key-value storage system. Image, tag, 
and thumbnail views translate to fetches from the store. 
The three SLA metrics are the simultaneous rates for up- 
loading photos, viewing photos, and viewing thumbnails 
associated with tags. Our input workload has, on aver- 
age, 200-KB images that convert to 4-KB thumbnails, 
and an average of 10 photos/tag and 10 tags/photo. 


We apply scc to study the cost as a function SLA by 
fixing the ratio of the rates for uploads, photo views, and 
tag views at 1:3:1. Figure 6 shows this cost distribu- 
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Figure 6: Cost versus SLA distribution output by scc for Pho- 
toShare. Note log scale on y axis. 


tion for a range of SLA values. Perhaps surprisingly, 
no huge spikes are observed in this distribution; this is 
because scc balances costs across the kind of storage, 
the number of CPUs, and the number of machines provi- 
sioned. Rather than adding more machines of the same 
type, the cluster architecture transitions to faster storage 
as the SLA becomes more stringent, with transitions in 
storage type for different datasets seen at different SLA 
values. Table 2 highlights these transitions. Note that the 
quantity in which different types of resources are provi- 
sioned varies within each architecture regime specified 
by every row in the table. 


We further compare the cost output by scc with the 
cost associated with a scale-out approach. We compare 
the scc configuration to the cases where the building 
block is based around: 1) storage servers with four 7.2K- 
RPM disks (the cost-optimal storage type for all datasets 
at the lowest SLA), and 2) servers with four 15K-RPM 
disks. In either case, more storage servers are added as 
the required rates increase. Figure 6 shows that the costs 
in both cases are significantly greater than with scc, in- 
curring between 3 and 4.5 times more cost (note the loga- 
rithmic y axis). Thus, simply scaling out a homogeneous 
configuration that is cost-effective at low loads can result 
in significant cost inflation at higher loads. 


To verify the performance of scc’s suggested configu- 
ration, we focus on one particular SLA: 100 uploads/s, 
300 photo views/s, and 100 tag views/s. The fraction of 
the SLA satisfied is the minimum fraction of sustained 
request rates across uploads, photo views, and tag views. 
scc determines the following cluster configuration for 
this SLA: one machine, with 4 CPU cores and 2 GB of 
DRAM hosts the webserver; a second machine stores the 
photos across four 15K-RPM disks; and a third machine 
hosts one SSD for thumbnails, and 1 GB of DRAM and 
one 7.2K-RPM disk for tags. Each of the two storage 
machines have 2 CPU cores and an additional | GB of 
DRAM, as required by the key-value storage system. 


Figure 5(a) shows that this configuration meets 
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Table 2: Different regimes based on SLA requirements in the 
cost-effective architecture for PhotoShare. 


the SLA; in fact, the configuration is slightly over- 
provisioned. It also shows the configuration is near a 
minimum: removing a core from the webserver (A/t/), 
replacing the thumbnail’s SSD with a cheaper 15K-RPM 
disk (A/t2), removing one of the photo disks (A/t3), or re- 
placing the thumbnail’s SSD with two 7.2K-RPM disks 
(A/lt4) all result in SLA misses. A scale-out architecture 
extending Alt4 with more 7.2K-RPM drives (A/t5) incurs 
30%-higher cost to meet the SLA. 


5.3. Product search 


Our second application is a multi-merchant product 
search and comparison service, which we call Product- 
Search. We store product tables, which include product 
serial numbers, types, descriptions, and costs, along with 
product type field indices in a Hadoop Distributed File 
System (HDFS). In addition, user rating data is stored 
in a separate database table. Worker processes running 
across the cluster process queries for the cheapest prod- 
uct of a given type with a minimum user-specified rating. 
Each worker maintains a local copy of the ratings table 
as well as an index on the product serial number field; 
the ratings table and index are hence, specified as local 
datasets in the application’s specification. To execute a 
query, a worker fetches the relevant product table and 
index from HDFS and then performs a join with the rat- 
ings table on the product serial number field, selecting 
for rows with the specified product type. 

In our deployment, we build product tables with an av- 
erage of 200K products, each with an average of 200 rat- 
ings. This translates to 8 GB for the ratings and roughly 
800 MB for each product table. The SLA for this appli- 
cation specifies the required query rate. 

We apply scc to determine system cost as a function 
of the SLA value. As with PhotoShare, the architec- 
ture of the cost-effective cluster changes significantly 
across different regimes of the SLA. At low query rates, 
scc recommends disks for both HDFS and local stor- 
age of workers. As the required query rate increases, 
scc transitions to using faster storage or provisioning 
more machines to handle the increased load. Figure 7 
illustrates one particular transition between query rate 
regimes. Also, in this case as well, scc’s configurations 
yield significant cost savings compared to simple scale- 
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Figure 7: Transition in scc’s output for ProductSearch from 
Config] at 12 queries/minute to Config2 at 13 queries/minute. 


out options—roughly 3x and 2x savings on average in 
comparison to the scaling out of homogeneous configu- 
rations with 7.2K-RPM and 15K-RPM disks, which are 
cost-optimal at low loads. 


We validate scc with an SLA of 12 queries per minute. 
scc’s cluster output for this case has two parts. First, 
the HDFS repository is stored across two machines, each 
with one CPU and two 7.2K-RPM disks. Second, 12 
worker processes are spread across three machines, each 
with one CPU and four SSDs. We run this configuration 
for 15 minutes. Figure 5(b), which plots the fraction of 
required queries completed during the experiment, shows 
that this configuration is able to meet the SLA. 


Next, we compare scc’s output with alternative config- 
urations. First, we consider clusters with alternative lo- 
cal storage for the workers—AlI/t/ and Alt2 use 15K-RPM 
drives, and A/t3 uses 7.2K-RPM disks with DRAM. 
In each case, we consider the number of workers and 
servers to keep cost comparable to scc. In both Alt/ and 
Alt2, the disk’s lower random read throughput inflates 
query processing times and, hence, aggregate through- 
put falls well below the SLA. The performance of A/t3 
comes close to the SLA, but still falls short. Second, 
when we place all four disks underlying HDFS into one 
machine (A/r4), the 1 Gbps network becomes a bottle- 
neck relative to the aggregate read throughput from four 
7.2K-RPM drives. As a result, download times increase, 
leading to SLA violations. 


We also use this example application to test scc’s abil- 
ity to capture knowledge of working set sizes. We again 
apply scc to satisfy the SLA of 12 queries per minute, but 
this time with the additional input that 20% of product 
types receive 80% of queries (the application specifica- 
tion for this case is shown in Figure 2(c)). In this case, 
scc Outputs an alternate architecture where 12 worker 
processes, previously run on three machines each with 
four SSDs, are now instead run on three machines each 
with four 15K-RPM disks and 10 GB of DRAM. Queries 
to “hot” products are served from DRAM and those to 
“cold” data are served from the disks. This configuration 
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meets the SLA with 7%-lower cost than the case where 
access patterns were assumed to be uniform. 


5.4 Sorting binary tuples 


Our final application, Terasort [29], is a MapReduce job 
that sorts collections of 100-byte tuples, each consist- 
ing of a 10-byte key and a 90-byte value. A Mapper 
reads tuples from a local input file and sends them over 
the network to appropriate Shuffle processes. Each Shuf- 
fler writes the tuples it receives to a set of intermediate, 
sorted local files. Once the Mappers and Shufflers are 
done, the Shuffle processes transform into the role of Re- 
ducers. Each Reducer merges the tuples in the local files 
into an output file of sorted tuples. For this application, 
the SLA is the total runtime of the MapReduce job. 

We use scc to determine the cost of clusters capable of 
sorting 50 GB for a range of runtimes. Note that though 
we put together clusters of individual servers here, we 
envision that scc will be used for such jobs to provision 
a set of virtual machines in a virtualized infrastructure. 
Unlike PhotoShare and ProductSearch, we see no signif- 
icant architecture changes over different runtimes. scc 
uses the basic building block of provisioning Mappers 
on machines with four cores and one 7.2K-RPM disk and 
Shufflers/Reducers on machines with four cores and two 
7.2K-RPM disks. scc provisions more machines for both 
components to meet more stringent SLAs. Faster storage 
has no benefits because the job is CPU bound. 

Next, we verify the performance of the cluster output 
by scc for an SLA that requires 50 GB to be sorted in 
25 minutes—an average sorting rate of 2 GB per minute. 
The scc cluster consists of 8 Mappers and 16 Reducers 
spread across two and four machines respectively with 
the above-mentioned building blocks. We run the appli- 
cation on this cluster to sort 50 GB of input data. Fig- 
ure 5(c) plots the SLA-specified runtime divided by the 
observed runtime and shows that the scc cluster meets 
the SLA. 

To evaluate the cost-effectiveness of scc’s output, we 
also sort 50 GB of data on several alternative architec- 
tures. A few such alternatives include Alt] and A/t2, 
which reduce the number of cores from 4 to 3 on the 
Mapper machines and on the Reducer machines, respec- 
tively. A/t3 substitutes the two 7.2K-RPM disks on each 
of the four Reducer machines with one 15K-RPM disk 
shared between the intermediate and output data. Fig- 
ure 5(c) shows that the runtime of the Terasort job misses 
the SLA by at least 10% in every case. The figure also 
shows that two other alternatives—A/?4 and Alt5—which 
have similar cost to scc’s output but trade off compute re- 
sources for more or faster storage, also fall short. 

Unlike our other two example applications, compute- 
intensive MapReduce jobs have a cluster configuration 
recommended by best practices. We modify the cluster 





Attribute Range with same architecture 
Lowest | Input Highest 
value value value 


Avg. photo size 50 KB | 200 KB 850 KB 
Avg. thumbnail size 30 KB 
SSD unit price $200 $450 $900 
(a) 


Dataset Most sensitive to what change 
in hardware costs? 
20% drop in $ of 7.2K-RPM disk 
Thumbnails 92% drop in $ of DRAM 








31% drop in $ of ISK-RPM disk 
(b) 


Table 3: Determining robustness of scc’s output with respect 
to its input: (a) robustness of cluster configuration with re- 
spect to input values for a sample set of attributes, and (b) the 
change in hardware costs to which scc’s storage decision for 
each dataset is most sensitive. 


architecture to be six machines each with four cores and 
two 7.2K-RPM disks—a setup recommended by Cloud- 
era for a “Balanced Compute Configuration” [13]. Also, 
we configure every node in the cluster to run a fixed num- 
ber of Mappers and Reducers. We evaluate three differ- 
ent combinations of Mappers and Reducers per node (the 
“2M 2R”, “2M 3R”, and “1M 3R” points in Figure 5(c)), 
and interestingly, we find that the recommended MapRe- 
duce configurations deliver lower performance than scc 
for similarly priced clusters. While all three alternatives 
meet the SLA when scaled out to an additional machine, 
e.g., the “2M 2R+” point in the figure, this results in 
16%-higher cost than scc’s recommendation. 


6 Discussion 


In this section, we discuss the robustness of scc’s output, 
its utility in planning application implementation archi- 
tectures, and its extensibility on other fronts. 


6.1 Robustness of scc’s output 


scc’s output cluster configuration for a target SLA is a 
function of both the SLA and the exact values specified 
for the various attributes in the application and hardware 
specifications. In practice, a user of scc may not have 
precise values for all attributes due to incomplete knowl- 
edge of the application workload, uncertainty of hard- 
ware costs, or measurement inaccuracy in benchmarking. 

scc 1s naturally built to cope with such uncertainty. 
For every attribute in the input specifications, scc varies 
the value of the attribute in the neighborhood of the ini- 
tially specified value. For each attribute, it then outputs 
the range of values for that attribute wherein the cost- 
effective cluster architecture, 1.e., the types of resources 
assigned to different application components, remains 
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unchanged; variance of the attribute’s value within this 
range can be handled by simply adding more resources 
of the same type. Outside of that range, the cluster will 
need to be revamped with a different type of resource for 
some application component, a significantly more cum- 
bersome undertaking. For example, we again consider 
PhotoShare with an SLA of 100 uploads/s, 300 photo 
views/s, and 100 tag views/s. Table 3(a) shows the value 
ranges output by scc for a few attributes, within which 
the cluster architecture is robust to change. For exam- 
ple, we see that as long as average photo size remains 
between 50 KB and 850 KB, the cluster architecture re- 
mains the same as that obtained with the input value of 
200KB. 

Furthermore, scc can also evaluate the sensitivity of 
its choice of storage configuration for every dataset in 
the application. For example, consider PhotoShare again 
with the same input SLA as above. Based on current 
hardware costs, scc determines that photos be stored on 
15K-RPM disks, thumbnails be stored on SSDs, and tags 
be stored persistently on 7.2K-RPM disks and cached 
in DRAM, in order to meet the SLA at minimum cost. 
However, these recommendations are likely to change as 
prices for storage units drop. scc can determine how ro- 
bust are its choice of storage options to such changes in 
hardware prices. To do so, it varies the price of every 
type of storage unit from its input value down to 0, and 
notes the inflection points at which the optimal storage 
choice for some dataset changes. Based on this analysis, 
it can determine, for every dataset, that change in hard- 
ware price to which the current storage choice for the 
dataset is most sensitive. Table 3(b) shows the output of 
this analysis for the three datasets in PhotoShare. While 
the storage choices for photos and tags are sensitive to 
relatively small reductions in the prices for 7.2K-RPM 
and 15k-RPM disks, scc’s recommendation of storing 
thumbnails on SSDs is very robust to price fluctuations. 


6.2 Informing application development 


Thus far, we assumed a fixed application implementa- 
tion. However, scc can also help determine the best ap- 
plication architecture. For instance, in the case of Tera- 
sort, there is a fundamental performance tradeoff be- 
tween a cluster configuration with sufficient DRAM to 
store all data to be sorted and one that must stage por- 
tions of the data into memory from secondary storage. 
The former case requires one read and one write of all the 
data while the latter requires two reads and two writes of 
the data [3]. 

To explore cost—performance tradeoffs for the two ap- 
plication architectures, we must consider the benefits of 
servers with more network bandwidth (so remote stor- 
age does not become a bottleneck) and more memory (to 
allow for storing the entire dataset in memory). In Ta- 
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ble 1, Server2 is the same HP ProLiant DL380 G6 server 
as Serverl, but with more resources per server and a 10- 
Gigabit Ethernet (1OGigE) NIC. Server3 is the HP Pro- 
Liant DL785 G5 Server, which accommodates more pro- 
cessors and DRAM, again with a lOGigE NIC. 

We use scc to determine the cluster configuration nec- 
essary to sort 100 TB in the time required to read/write 
the whole data from/to disks twice at the read/write rate 
of the 7.2K-RPM disk. This cluster costs $239K and 
completes sort in 10,000 seconds. For the alternative im- 
plementation where all data fits in DRAM, we apply scc 
to satisfy the SLA of sorting the complete dataset in half 
the SLA of the baseline implementation. The cheapest 
cluster configuration determined in this case costs $5.6M 
and sorts 100 TB in 5,000 seconds. Thus, according to 
scc, the latter implementation provides a 2 speedup at 
24x the cost. The application designer can decide if the 
faster processing is worth it. 


6.3. Extensibility of scc 


Our approach of determining cost-effective cluster con- 
figurations with scc is extensible in several ways. 

Less flexible infrastructure services. Though we re- 
strict our attention in this paper to flexible infrastructure 
services that permit arbitrary mixing and matching of 
compute and storage resources on a per-server or per- 
VM basis, scc can also be readily applied to less flexible 
services that offer only certain combinations of proces- 
sor, storage, and memory configurations, e.g., Amazon’s 
EC2 service [1]. In such cases, each combination of re- 
sources offered by the infrastructure service can be pro- 
vided as input to scc as a separate server type, and the 
cost of each server will subsume the costs of all the re- 
sources that come with it. 

Accounting for availability. Though we have fo- 
cused on performance requirements of applications thus 
far, performance and availability SLAs need to be con- 
sidered in unison. For example, a cheap disk type may 
be an attractive option for a capacity-bound dataset but 
the degree of replication necessary to meet availability 
goals may make the option cost-prohibitive. scc can be 
extended to pick for each dataset that combination of 
storage type and associated replication factor that meets 
the combination of performance, availability, and consis- 
tency requirements at minimum cost. 

Load variation and incremental growth. Our cur- 
rent implementation of scc provisions applications for 
peak load. However, when the distribution of load across 
time is available, scc can leverage the information in two 
ways. First, scc can estimate energy costs more accu- 
rately. Second, when pricing for resources is “elastic”, 
1.€., a user Can provision resources on-demand and pay 
for what she uses, scc can make incremental reconfigu- 
ration decisions, determining when to simply scale-out 
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and when to switch between architectures. scc’s distinc- 
tion between remote persistent datasets and local tran- 
sient datasets enables it to capture the costs associated 
with data redistribution. 

Network configuration and CPU diversity. scc’s 
specification of application behavior can be used to in- 
fer the communication pattern among the application’s 
components, and thus inform configuration of the clus- 
ter’s network. For example, in the case of ProductSearch, 
scc can infer from the application specification that the 
workers communicate only with the HDFS repository but 
not among themselves. scc can then use this information 
to recommend a bi-partite network with servers hosting 
HDEFS on one side and servers hosting workers on the 
other side. scc can also be readily extended to choose 
among a range of CPUs; the application specification 
simply needs to include for every compute operation the 
time required for that operation on each type of CPU. 


7 Related work 


Our work builds upon and shares some similarities with 
several lines of prior work. 

Tuning storage: Minerva [5], Hippodrome [7], and 
Rome [34] automate the provisioning of disk arrays with 
a similar approach of characterizing workloads and stor- 
age. Ursa Minor [2] varies erasure coding parameters 
depending on an application’s availability requirements. 
PADS [9] is configurable to build a wide range of replica- 
tion systems with varying consistency semantics. In con- 
trast to all of these efforts, we consider an application’s 
storage and compute requirements in unison. Moreover, 
we choose among different storage media such as disk, 
SSD, and DRAM to minimize cost, with multiple media 
possibly being used for the same application. 

Application modeling: Bodik et al. [10] infer appli- 
cation performance models by applying machine learn- 
ing techniques on statistics gathered by monitoring the 
application execution. Thereska et al. [31] predict per- 
formance across application configurations based on sta- 
tistical models. IRONModel [32] corrects deviations be- 
tween the performance of running systems and high fi- 
delity models. In all cases, since application models are 
tuned to specific cluster configurations, they are not di- 
rectly applicable to alternative hardware configurations. 

Stewart and Shen [30] build performance models of 
multi-component applications to aid in the placement of 
application components on a given cluster. Osogami 
and Itoko [25] apply hill-climbing techniques to auto- 
matically determine web-server parameters, and Liu et 
al. [20] construct a queuing model for a three-tiered web 
service to predict throughput and response times. Again, 
all of these consider a fixed hardware configuration. 

Application-specific cluster architectures: Applica- 
tion developers have converged on a range of cluster 


architectures for individual applications. Several web 
services employ DRAM caches using distributed in- 
memory storage systems [21, 26]. Applications such 
as WER [16] use clusters that have separate sets of 
machines for compute and storage. FAWN [6] and 
Gordon [11] use SSDs to build performant yet power- 
efficient distributed data processing systems. MR- 
Perf [33] and Starfish [18] use an approach similar to 
scc but focus solely on predicting cluster requirements 
of MapReduce setups. scc not only infers these cost- 
effective architectures for existing applications, but also 
enables the inference of the right cluster architecture for 
emerging applications. 


Storage and computing services. There been a few 
recent attempts [19, 15] at satisfying SLAs in the set- 
ting of a compute and storage cluster shared across appli- 
cations. Such multi-application environments have also 
seen the recent emergence of virtual storage appliances. 
scc 1s targeted at the still significantly more common sce- 
nario of cluster deployments for a single application. 


$ Conclusions 


The thesis of our work is that deployment of applications 
on clusters is more cost-effective if informed by charac- 
terizations of application behavior and hardware proper- 
ties. Towards this end, we presented how these inputs 
can be specified, and we developed scc to compile these 
inputs into cost-effective cluster configurations. Our ex- 
periments in applying scc to a range of application work- 
loads and storage options show that scc captures suffi- 
cient detail to prescribe the right combination of storage 
and server hardware at the right scale; modifying the ar- 
chitecture or reducing the scale leads to significant per- 
formance degradation. To meet application demands, scc 
often predicts heterogeneous cluster architectures that re- 
sult in significant cost savings compared to simply scal- 
ing out homogeneous architectures. We plan to apply 
scc to other popular applications to determine more fine- 
grained characteristics from which it could benefit, and 
use scc’s application specification to select appropriate 
CPUs and optimize network costs. We also plan to de- 
velop tools to make it easier to put together hardware and 
application specifications. 
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Abstract 


Deduplication technologies are increasingly being de- 
ployed to reduce cost and increase space-efficiency in 
corporate data centers. However, prior research has not 
applied deduplication techniques inline to the request 
path for latency sensitive, primary workloads. This is 
primarily due to the extra latency these techniques intro- 
duce. Inherently, deduplicating data on disk causes frag- 
mentation that increases seeks for subsequent sequential 
reads of the same data, thus, increasing latency. In addi- 
tion, deduplicating data requires extra disk IOs to access 
on-disk deduplication metadata. In this paper, we pro- 
pose an inline deduplication solution, 1iDedup, for pri- 
mary workloads, while minimizing extra [Os and seeks. 

Our algorithm is based on two key insights from real- 
world workloads: 1) spatial locality exists in duplicated 
primary data; and 11) temporal locality exists in the access 
patterns of duplicated data. Using the first insight, we se- 
lectively deduplicate only sequences of disk blocks. This 
reduces fragmentation and amortizes the seeks caused by 
deduplication. The second insight allows us to replace 
the expensive, on-disk, deduplication metadata with a 
smaller, in-memory cache. These techniques enable us 
to tradeoff capacity savings for performance, as demon- 
strated in our evaluation with real-world workloads. Our 
evaluation shows that iDedup achieves 60-70% of the 
maximum deduplication with less than a 5% CPU over- 
head and a 2-4% latency impact. 


1 Introduction 


Storage continues to grow at an explosive rate of over 
52% per year [10]. In 2011, the amount of data will sur- 
pass 1.8 zettabytes [17]. According to the IDC [10], to 
reduce costs and increase storage efficiency, more than 
80% of corporations are exploring deduplication tech- 
nologies. However, there is a huge gap in the current ca- 
pabilities of deduplication technology. No deduplication 


systems exist that deduplicate inline with client requests 
for latency sensitive primary workloads. All prior dedu- 
plication work focuses on either: 1) throughput sensitive 
archival and backup systems [8, 9, 15, 21, 26, 39, 41]; 
or 11) latency sensitive primary systems that deduplicate 
data offline during idle time [1, 11, 16]; or 111) file sys- 
tems with inline deduplication, but agnostic to perfor- 
mance [3, 36]. This paper introduces two novel insights 
that enable latency-aware, inline, primary deduplication. 

Many primary storage workloads (e.g., email, user di- 
rectories, databases) are currently unable to leverage the 
benefits of deduplication, due to the associated latency 
costs. Since offline deduplication systems impact la- 
tency the least, they are currently the best option; how- 
ever, they are inefficient. For example, offline systems 
require additional storage capacity to absorb the writes 
prior to deduplication, and excess disk bandwidth to per- 
form reads and writes during deduplication. This ad- 
ditional disk bandwidth can impact foreground work- 
loads. Additionally, inline compression techniques also 
exist [5, 6, 22, 38] that are complementary to our work. 

The challenge of inline deduplication is to not increase 
the latency of the already latency sensitive, foreground 
operations. Reads are affected by the fragmentation 
in data layout that naturally occurs when deduplicating 
blocks across many disks. As a result, subsequent se- 
quential reads of deduplicated data are transformed into 
random [Os resulting in significant seek penalties. Most 
of the deduplication work occurs in the write path; Le., 
generating block hashes and finding duplicate blocks. To 
identify duplicates, on-disk data structures are accessed. 
This leads to extra IOs and increased latency in the write 
path. To address these performance concerns, it is nec- 
essary to minimize any latencies introduced in both the 
read and write paths. 

We started with the realization that in order to improve 
latency a tradeoff must be made elsewhere. Thus, we 
were motivated by the question: Is there a tradeoff be- 
tween performance and the degree of achievable dedu- 
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plication? While examining real-world traces [20], we 
developed two key insights that ultimately led to an an- 
swer: 1) spatial locality exists in the duplicated data; and 
11) temporal locality exists in the accesses of duplicated 
data. The first observation allows us to amortize the 
seeks caused by deduplication by only performing dedu- 
plication when a sequence of on-disk blocks are dupli- 
cated. The second observation enables us to maintain an 
in-memory fingerprint cache to detect duplicates in lieu 
of any on-disk structures. The first observation mitigates 
fragmentation and addresses the extra read path latency; 
whereas, the second one removes extra [Os and lowers 
write path latency. These observations lead to two con- 
trol parameters: 1) the minimum number of sequential 
duplicate blocks on which to perform deduplication; and 
11) the size of the in-memory fingerprint cache. By ad- 
justing these parameters, a tradeoff is made between the 
capacity savings of deduplication and the performance 
impact to the foreground workload. 

This paper describes the design, implementation and 
evaluation of our deduplication system (1Dedup) built to 
exploit the spatial and temporal localities of duplicate 
data in primary workloads. Our evaluation shows that 
good capacity savings are achievable (between 60%-70% 
of maximum) with a small impact to latency (2-4% on 
average). In summary, our key contributions include: 


e Insights on spatial and temporal locality of dupli- 
cated data in real-world, primary workloads. 

e Design of an inline deduplication algorithm that 
leverages both spatial and temporal locality. 

e Implementation of our deduplication algorithm in an 
enterprise-class, network attached storage system. 

e Implementation of efficient data structures to reduce 
resource overheads and improve cacheability. 

e Demonstration of a viable tradeoff between perfor- 
mance and capacity savings via deduplication. 

e Evaluation of our algorithm using data from real- 
world, production, enterprise file system traces. 


The remainder of the paper is as follows: Section 2 
provides background and motivation of the work; Sec- 
tion 3 describes the design of our deduplication system; 
Section 4 describes the system’s implementation; Sec- 
tion 5 evaluates the implementation; Section 6 describes 
related work, and Section 7 concludes. 


2 Background and motivation 


Thus far, the majority of deduplication research has tar- 
geted improving deduplication within the backup and 
archival (or secondary storage) realm. As shown in Ta- 
ble 1, very few systems provide deduplication for latency 
sensitive primary workloads. We believe that this is due 
to the significant challenges in performing deduplication 
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Type 


Primary, 
latency 
sensitive 


| offine | Imtine 


NetApp ASIS [1], 
EMC Celerra [11], 
StorageTank [16], 


iDedup 
(This paper) 


EMC DDFS [41], 
EMC Cluster [8] 
DeepStore [40], 

NEC HydraStor [9], 
Venti [31], SiLo [39], 
Sparse Indexing [21], 
ChunkStash [7], 
Foundation [32], 
Symantec [15], 

EMC Centera [24], 
GreenBytes [13] 
Table 1: Table of related work:. The table shows how this pa- 
per, iDedup, is positioned relative to some other relevant work. 
Some primary, inline deduplication file systems (like ZFS [3]) 
are omitted, since they are not optimized for latency. 


Secondary, 
throughput 
sensitive 
(No motivation 


for systems in 
this category) 


without affecting latency, rather than the lack of benefit 
deduplication provides for primary workloads. Our sys- 
tem is specifically targeted at this gap. 

The remainder of this section further describes the dif- 
ferences between primary and secondary deduplication 
systems and describes the unique challenges faced by 
primary deduplication systems. 


2.1 Classifying deduplication systems 


Although many classifications for deduplication systems 
exist, they are usually based on internal implementation 
details, such as the fingerprinting (hashing) scheme or 
whether fixed sized or variable sized blocks are used. Al- 
though important, these schemes are usually orthogonal 
to the types of workloads their system supports. Similar 
to other storage systems, deduplication systems can be 
broadly classified as primary or secondary depending on 
the workloads they serve. Primary systems are used for 
primary workloads. These workloads tend to be latency 
sensitive and use RPC based protocols, such as NFS [30], 
CIFS [37] or iSCSI [35]. On the other hand, secondary 
systems are used for archival or backup purposes. These 
workloads process large amounts of data, are throughput 
sensitive and are based on streaming protocols. 

Primary and secondary deduplication systems can be 
further subdivided into inline and offline deduplication 
systems. Inline systems deduplicate requests in the write 
path before the data is written to disk. Since inline dedu- 
plication introduces work into the critical write path, it 
often leads to an increase in request latency. On the other 
hand, offline systems tend to wait for system idle time to 
deduplicate previously written data. Since no operations 
are introduced within the write path; write latency is not 
affected, but reads remain fragmented. 
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Content addressable storage (CAS) systems (e.g., [24, 
31]) naturally perform inline deduplication, since blocks 
are typically addressed by their fingerprints. Both 
archival and CAS systems are sometimes used for pri- 
mary storage. Likewise, a few file systems that perform 
inline deduplication (e.g., ZFS [3] and SDFS [36]) are 
also used for primary storage. However, none of these 
systems are specifically optimized for latency sensitive 
workloads while performing inline deduplication. Their 
design for maximum deduplication introduces extra IOs 
and does not address fragmentation. 

Primary inline deduplication systems have the follow- 
ing advantages over offline systems: 


1. Storage provisioning is easier and more efficient: Of- 
fline systems require additional space to absorb the 
writes prior to deduplication processing. This causes 
a temporary bloat in storage usage leading to inaccu- 
rate space accounting and provisioning. 

2. No dependence on system idle time: Offline sys- 
tems use idle time to perform deduplication without 
impacting foreground requests. This is problematic 
when the system is busy for long periods of time. 

3. Disk-bandwidth utilization is lower: Offline systems 
use extra disk bandwidth when reading in the staged 
data to perform deduplication and then again to write 
out the results. This limits the total bandwidth avail- 
able to the system. 


For good reason, the majority of prior deduplication 
work has focused on the design of inline, secondary 
deduplication systems. Backup and archival workloads 
typically have a large amount of duplicate data, thus 
the benefit of deduplication is large. For example, re- 
ports of 90+-% deduplication ratios are not uncommon 
for backup workloads [41], compared to the 20-30% we 
observe from our traces of primary workloads. Also, 
since backup workloads are not latency sensitive, they 
are tolerant to delays introduced in the request path. 


2.2 Challenges of primary deduplication 


The almost exclusive focus on maximum deduplication 
at the expense of performance has left a gap for la- 
tency sensitive workloads. Since primary storage is usu- 
ally the most expensive, any savings obtained in primary 
systems has high cost advantages. Due to their higher 
cost ($/GB), deduplication is even more critical for flash 
based systems; nothing precludes our techniques from 
working with these systems. In order for primary, in- 
line deduplication to be practical for enterprise systems, 
a number of challenges must be overcome: 


e Write path: The metadata management and IO re- 
quired to perform deduplication inline with the write 
request increases write latency. 
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Figure 1: a) Increase in seeks due to increased fragmenta- 
tion. b) The amortization of seeks using sequences. This figure 
shows the amortization of seeks between disk tracks by using 
sequences of blocks (threshold=3). 


e Read path: The fragmentation of otherwise sequen- 
tial writes increases the number of disk seeks re- 
quired during reads. This increases read latency. 

e Delete path: The requirement to check whether a 
block can be safely deleted increases delete latency. 


All of these penalties, due to deduplication, impact the 
performance of foreground workloads. Thus, primary 
deduplication systems only employ offline techniques to 
avoid interfering with foreground requests [1, 11, 16]. 


Write path: For inline, primary deduplication, write 
requests deduplicate data blocks prior to writing those 
blocks to stable storage. At a minimum, this involves 
fingerprinting the data block and comparing its signature 
within a table of previously written blocks. If a match 
is found, the metadata for the block, e.g., the file’s block 
pointer, is updated to point to the existing block and no 
write to stable storage is required. Additionally, a ref- 
erence count on the existing block is incremented. If a 
match is not found, the block is written to stable storage 
and the table of existing blocks is updated with the new 
block’s signature and its storage location. The additional 
work performed during write path deduplication can be 
summarized as follows: 


e Fingerprinting data consumes extra CPU resources. 

e Performing fingerprint table lookups and managing 
the table persistently on disk requires extra IOs. 

e Updating a block’s reference count requires an up- 
date to persistent storage. 


As one can see, the management of deduplication meta- 
data, in memory, and on persistent storage, accounts for 
the majority of write path overheads. Even though much 
previous work has explored optimizing metadata man- 
agement for inline, secondary systems (e.g., [2, 15, 21, 
39, 41]), we feel that it is necessary to minimize all extra 
IO in the critical path for latency sensitive workloads. 
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Read path: Deduplication naturally fragments data that 
would otherwise be written sequentially. Fragmentation 
occurs because a newly written block may be dedupli- 
cated to an existing block that resides elsewhere on stor- 
age. Indeed, the higher the deduplication ratio, the higher 
the likelihood of fragmentation. Figure 1(a) shows the 
potential impact of fragmentation on reads in terms of the 
increased number seeks. When using disk based storage, 
the extra seeks can cause a substantial increase in read 
latency. Deduplication can convert sequential reads from 
the application into random reads from storage. 


Delete path: Typically, some metadata records the us- 
age of shared blocks. For example, a table of reference 
counts can be maintained. This metadata must be queried 
and updated inline to the deletion request. These actions 
can increase the latency of delete operations. 


3 Design 


In this section, we present the rationale that led to our so- 
lution, the design of our architecture, and the key design 
challenges of our deduplication system (iDedup). 


3.1 Rationale for solution 


To better understand the challenges of inline deduplica- 
tion, we performed data analysis on real-world enterprise 
workloads [20]. First, we ran simulations varying the 
block size to see its effect on deduplication. We observed 
that the drop in deduplication ratio was less than linear 
with increasing block size. This implies duplicated data 
is clustered, thus indicating spatial locality in the data. 
Second, we ran simulations varying the fingerprint table 
size to determine if the same data is written repeatedly 
close in time. Again, we observed the drop in deduplica- 
tion ratio was less than linear with decreasing table size. 
This implies duplicated data exhibits notable temporal 
locality, thus making the fingerprint table amenable to 
caching. Unfortunately, we could not test our hypothe- 
sis on other workloads due to the lack of traces with data 
duplication patterns. 


3.2 Solution overview 


We use the observations of spatial and temporal locality 
to derive an inline deduplication solution. 


Spatial locality: We leverage the spatial locality to per- 
form selective deduplication, thereby mitigating the extra 
seeks introduced by deduplication for sequentially read 
files. To accomplish this, we examine blocks at write 
time and attempt to only deduplicate full sequences of 
file blocks if and only if the sequence of blocks are i) 
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sequential in the file and ti) have duplicates that are se- 
quential on disk. Even with this optimization, sequential 
reads can still incur seeks between sequences. However, 
if we enforce an appropriate minimum sequence length 
for such sequences (the threshold), the extra seek cost 
is expected to be amortized; as shown by Figure 1(b). 
The threshold is a configurable parameter in our system. 
While some schemes employ a larger block size to lever- 
age spatial locality, they are limited as the block size rep- 
resents both the minimum and the maximum sequence 
length. Whereas, our threshold represents the minimum 
sequence length and the maximum sequence length is 
only limited by the file’s size. 


Inherently, due to our selective approach, only a sub- 
set of blocks are deduplicated, leading to lower capacity 
savings. Therefore, our inline deduplication technique 
exposes a tradeoff between capacity savings and perfor- 
mance, which we observe via experiments to be reason- 
able for certain latency sensitive workloads. For an op- 
timal tradeoff, the threshold must be derived empirically 
to match the randomness in the workload. Additionally, 
to recover the lost savings, our system does not preclude 
executing other offline techniques. 


Temporal locality: In all deduplication systems, there 
is a structure that maps the fingerprint of a block and its 
location on disk. We call this the deduplication meta- 
data structure (or dedup-metadata for short). Its size is 
proportional to the number of blocks and it 1s typically 
stored on disk. Other systems use this structure as a 
lookup table to detect duplicates in the write path; this 
leads to extra, expensive, latency-inducing, random IOs. 


We leverage the temporal locality by maintaining 
dedup-metadata as a completely memory-resident, LRU 
cache, thereby, avoiding extra dedup-metadata IOs. 
There are a few downsides to using a smaller, in-memory 
cache. Since we only cache mappings for a subset of 
blocks, we might not deduplicate certain blocks due to 
lack of information. In addition, the memory used by the 
cache reduces the file system’s buffer cache size. This 
can lead to a lower buffer cache hit rate, affecting la- 
tency. On the other hand, the buffer cache becomes more 
effective by caching deduplicated blocks [19]. These ob- 
servations expose another tradeoff between performance 
(hit rate) and capacity savings (dedup-metadata size). 


3.3. Architecture 


In this subsection, we provide an overview of our archi- 
tecture. In addition, we describe the changes to the IO 
path to perform inline deduplication. 
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Figure 2: iDedup Architecture. Non-deduplicated blocks (dif- 
ferent patterns) in NVRAM buffer are deduplicated by the iD- 
edup algorithm before writing them to disk via the file system. 


3.3.1 Storage system overview 


An enterprise-class network-attached storage (NAS) sys- 
tem (as illustrated in Figure 2) is used as the reference 
system to build iDedup. For primary workloads, the sys- 
tem supports the NFS [30] and CIFS [37] RPC-based 
protocols. As seen in Figure 2, the system uses a log- 
structured file system [34] combined with non-volatile 
RAM (NVRAM) to buffer client writes to reduce re- 
sponse latency. These writes are periodically flushed to 
disk during the destage phase. Allocation of new disk 
blocks occur during this phase and is performed succes- 
sively for each file written. Individual disk blocks are 
identified by their unique disk block numbers (DBNs). 
File metadata, containing the DBNs of its blocks, is 
stored within an inode structure. Given our objective to 
perform inline deduplication, the newly written (dirty) 
blocks need to be deduplicated during the destage phase. 
By performing deduplication during destage, the system 
benefits by not deduplicating short-lived data that is over- 
written or deleted while buffered in NVRAM. Adding in- 
line deduplication modifies the write path significantly. 


3.3.2 Write path flow 


Compared to the normal file system write path, we add 
an extra layer of deduplication processing. As this layer 
consumes extra CPU cycles, it can prolong the total time 
required to allocate dirty blocks and affect time-sensitive 
file system operations. Moreover, any extra [Os in this 
layer can interfere with foreground read requests. Thus, 
this layer must be optimized to minimize overheads. On 
the other hand, there is an opportunity to overlap dedu- 
plication processing with disk write IOs in the destage 


phase. The following steps take place in the write path: 

1. For each file, the list of dirty blocks is obtained. 

2. For each dirty block, we compute its fingerprint (hash 
of the block’s content) and perform a lookup in the 
dedup-metadata structure using the hash as the key. 

3. If a duplicate is found, we examine adjacent blocks, 
using the 1Dedup algorithm (Section 3.4), to deter- 
mine if it is part of a duplicate sequence. 

4. While examining subsequent blocks, some duplicate 
sequences might end. In those cases, the length of the 
sequence is determined, if it is greater than the con- 
figured threshold, we mark the sequence for dedupli- 
cation. Otherwise, we allocate new disk blocks and 
add the fingerprint metadata for these blocks. 

5. When a duplicate sequence is found, the DBN of 
each block in the sequence is obtained and the file’s 
metadata is updated and eventually written to disk. 

6. Finally, to maintain file system integrity in the face of 
deletes, we update reference counts of the duplicated 
blocks in a separate structure on disk. 


3.3.3, Read path flow 


Since iDedup updates the file’s metadata as soon as dedu- 
plication occurs, the file system cannot distinguish be- 
tween a duplicated block and a non-duplicated one. This 
allows file reads to occur in the same manner for all files, 
regardless of whether they contain deduplicated blocks. 
Although sequential reads may incur extra seeks due to 
deduplication, having a minimum sequence length helps 
amortize this cost. Moreover, if we pick the threshold 
closer to the expected sequentiality of a workload, then 
the effects of those seeks can be hidden. 


3.3.4 Delete path flow 


As mentioned in the write path flow, deduplicated blocks 
need to be reference counted. During deletion, the ref- 
erence count of deleted blocks is decremented and only 
blocks with no references are freed. In addition to updat- 
ing the reference counts, we also update the in-memory 
dedup-metadata when a block is deleted. 


3.4 


The iDedup deduplication algorithm has the following 
key design objectives: 


iDedup algorithm 


1. The algorithm should be able to identify sequences of 
file blocks that are duplicates and whose correspond- 
ing DBNs are sequential. 

2. The largest duplicate sequence for a given set of file 
blocks should be identified. 

3. The algorithm should minimize searches in the 
dedup-metadata to reduce CPU overheads. 


FAST 7°12: 10th USENIX Conference on File and Storage Technologies 


303 


304 


4. The algorithm execution should overlap with disk IO 
during the destage phase and not prolong the phase. 

5. The memory and CPU overheads caused by the algo- 
rithm should not prevent other file system processes 
from accomplishing their tasks in a timely manner. 

6. The dedup-metadata must be optimized for lookups. 


More details of the algorithm are presented in Sec- 
tion 4.2. Next, we describe the design elements that en- 
able these objectives. 


3.4.1 Dedup-metadata cache design 


The dedup-metadata is maintained as a cache with one 
entry per block. Each entry maps the fingerprint of a 
block to its DBN on disk. We use LRU as the cache 
replacement policy; other replacement policies did not 
perform better than the simpler LRU scheme. 

The choice of the fingerprint influences the size of the 
entry and the number of CPU cycles required to compute 
it. By leveraging processor hardware assists (for e.g., 
Intel AES [14]) to compute stronger fingerprints (like 
SHA-2, SHA-256 [28], etc.), the CPU overhead can be 
greatly mitigated. However, longer, 256-bit fingerprints 
increase the size of each entry. In addition, a DBN of 
32-bits to 64-bits must also be kept within the entry, thus 
making the minimum entry size 36 bytes. Given a block 
size of 4 KB (typical of many file systems), the cache 
entries comprise an overhead of 0.8% of the total size. 
Since we keep the cache in memory, this overhead is sig- 
nificant as it reduces the number of cached blocks. 

In many storage systems, memory not reserved for 
data structures is used by the buffer cache. Hence, the 
memory used by the dedup-metadata cache comes at the 
expense of a larger buffer cache. Therefore, the effect 
of the dedup-metadata cache on the buffer cache hit ratio 
needs to be evaluated empirically to size the cache. 


3.4.2 Duplicate sequence processing 


This subsection describes some common design issues 1n 
duplicate sequence identification. 


Sequence identification: The goal is to identify the 
largest sequence among the list of potential sequences. 
This can be done in multiple ways: 


e Breadth-first: Start by scanning blocks in order; con- 
currently track all possible sequences; and decide on 
the largest when a sequence terminates. 

e Depth-first: Start with a sequence and pursue it 
across the blocks until it terminates; make multiple 
passes until all sequences are probed; and then pick 
the largest. Information gathered during one pass can 
be utilized to make subsequent passes more efficient. 
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Figure 3: Overlapped sequences. This figure shows an exam- 
ple of how the algorithm works with overlapped sequences. 


In practice, we observed long chains of blocks during 
processing (order of 1000s). Since multiple passes is too 
expensive, we use the breadth-first approach. 


Overlapped sequences: Choosing between a set of over- 
lapped sequences can prove problematic. An example of 
how overlapping sequences are handled is illustrated in 
Figure 3. Assume a threshold of 4. Scanning from left 
to right, multiple sequences match the set of file blocks. 
As we process the 7th block, one of the sequences ter- 
minates (S1) with a length 6. But, sequences S2 and 
S3 have not yet terminated and have blocks overlapping 
with S1. Since S1 is longer than the threshold (4), we can 
deduplicate the file blocks matching those in S1. How- 
ever, by accepting Sl, we are rejecting the overlapped 
blocks from $2 or S3; this is the dilemma. It is possi- 
ble that either S2 or S3 could potentially lead to a longer 
sequence going forward, but it is necessary to make a de- 
cision about S1. Since it is not possible to know the best 
outcome, we use the following heuristic: we determine 
if the set of non-overlapped blocks is greater than thresh- 
old, if so, we deduplicate them. Otherwise, we defer to 
the unterminated sequences, as they may grow longer. 
Thus, in the example, we reject S1 for this reason. 


3.4.3. Threshold determination 


The minimum sequence threshold is a workload property 
that can only be derived empirically. The ideal threshold 
is one that most closely matches the workload’s natural 
sequentiality. For workloads with more random IO, it is 
possible to set a lower threshold because deduplication 
should not worsen the fragmentation. It is possible to 
have a real-time, adaptive scheme that sets the thresh- 
old based on the randomness of the workload. Although 
valuable, this investigation is beyond this paper’s scope. 


4 Implementation 


In this section, we present the implementation and op- 
timizations of our inline deduplication system. The im- 


USENIX Association 


USENIX Association 


plementation consists of two subsystems: 1) the dedup- 
metadata management; and 11) the iDedup algorithm. 


4.1 Dedup-metadata management 


The dedup-metadata management subsystem is com- 
prised of several components: 


1. Dedup-metadata cache (in RAM): Contains a pool of 
block entries (content-nodes) that contain deduplica- 
tion metadata organized as a cache. 

2. Fingerprint hash table (in RAM): This table maps a 
fingerprint to DBN(s). 

3. DBN hash table (in RAM): This table maps a DBN 
to its content-node; used to delete a block. 

4. Reference count file (on disk): Maintains reference 
counts of deduplicated file system blocks in a file. 


We explore each of them next. 


4.1.1 Dedup-metadata cache 


This is a fixed-size pool of small entries called content- 
nodes, managed as an LRU cache. The size of this pool 
is configurable at compile time. Each content-node rep- 
resents a single disk block and is about 64 bytes in size. 
The content-node contains the block’s DBN (a 4 B in- 
teger) and its fingerprint. In our prototype, we use the 
MD5 checksum (128-bit) [33] of the block’s contents as 
its fingerprint. Using a stronger fingerprint (like SHA- 
256) would increase the memory overhead of each entry 
by 25%, thus leading to fewer blocks cached. Other than 
this effect, using MD5 is not expected to alter other ex- 
perimental results. 

All the content-nodes are allocated as a single global 
array. This allows the nodes to be referenced by their 
array index (a 4 byte value) instead of by a pointer. This 
saves 4 bytes per pointer in 64-bit systems. Each content- 
node is indexed by three data structures: the fingerprint 
hash table, the DBN hash table and the LRU list. This 
adds two pointers per index (to doubly link the nodes in 
a list or tree), thus totaling six pointers per content-node. 
Therefore, by using array indices instead of pointers we 
save 24 bytes per entry (37.5%). 


4.1.2 Fingerprint hash table 


This hash table contains content-nodes indexed by their 
fingerprint. It enables a block’s duplicates to be identi- 
fied by using the block’s fingerprint. As shown in Fig- 
ure 4, each hash bucket contains a single pointer to the 
root of a red-black tree containing the collision list for 
that bucket. This is in contrast to a traditional hash ta- 
ble with a doubly linked list for collisions at the cost of 
two pointers per bucket. The red-black tree implemen- 
tation is an optimized, left-leaning, red-black tree [12]. 
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Figure 4: Fingerprint Hash Table. The fingerprint hash ta- 
ble with hash buckets as pointers to collision trees. Content- 
node with fingerprint ‘foo’ has duplicate content-nodes in a tree 
(dup-tree) with DBNs 205 and 110. 


With uniform distribution, each hash bucket is designed 
to hold 16 entries, ensuring an upper-bound of 4 searches 
within the collision tree (tree search cost is OdlogN)). By 
reducing the size of the pointers and the number of point- 
ers per bucket, the per-bucket overhead is reduced, thus 
providing more buckets for the same memory size. 

Each collision tree content-node represents a unique 
fingerprint value in the system. For thresholds greater 
than one, it is possible for multiple DBNs to have the 
same fingerprint, as they can belong to different dupli- 
cate sequences. Therefore, all the content-nodes that rep- 
resent duplicates of a given fingerprint are added to an- 
other red-black tree, called the dup-tree (see Figure 4). 
This tree is rooted at the first content-node that maps to 
that fingerprint. There are advantages to organizing the 
duplicate content-nodes in a tree, as explained in the 1D- 
edup algorithm section (Section 4.2). 


4.1.3 DBN hash table 


This hash table indexes content-nodes by their DBNs. 
Its structure is similar to the fingerprint hash table with- 
out the dup-tree. It facilitates the deletion of content- 
nodes when the corresponding blocks are removed from 
the system. During deletion, blocks can only be identi- 
fied by their DBNs (otherwise the data must be read and 
hashed). The DBN is used to locate the corresponding 
content-node and delete it from all dedup-metadata. 


4.1.4 Reference count file 


The refcount file stores the reference counts of all dedu- 
plicated blocks on disk. It is ordered by DBN and main- 
tains a 32-bit counter per block. When a block is deleted, 
its entry in the refcount file is decremented. When the 
reference count reaches zero, the block’s content-node 
is removed from all dedup-metadata and the block is 
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Figure 5: Identification of sequences. This figure shows how 
sequences are identified. 
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marked free in the file system’s metadata. The refcount 
file is also updated when a block is written. By dedu- 
plicating sequential blocks, we observe that refcount 
updates are often collocated to the same disk blocks, 
thereby amortizing [Os to the refcount file. 


4.2 


For each file, the 1Dedup algorithm has three phases: 


iDedup algorithm 


1. Sequence identification: Identify duplicate block se- 
quences for file blocks. 

2. Sequence pruning: 
based on their length. 


Process duplicate sequences 


3. Sequence deduplication: Deduplicate sequences 
greater than the configured threshold. 


We examine these phases next. 


4.2.1 Sequence identification 


In this phase, a set of newly written blocks, for a particu- 
lar file, are processed. We use the breadth-first approach 
for determining duplicate sequences. We start by scan- 
ning the blocks in order and utilize the fingerprint hash 
table to identify any duplicates for these blocks. We fil- 
ter the blocks to pick only data blocks that are complete 
(1.e., of size 4 KB) and that do not belong to special or 
system files (e.g., the refcount file). During this pass, we 
also compute the MD5 hash for each block. 

In Figure 5, the blocks B(n) (n = 1,2,3....) and the 
corresponding fingerprints H(n) (n = 1,2,3...) are shown. 
Here, n represents the block’s offset within the file (the 
file block number or FBN). The minimum length of a du- 
plicate sequence is two; so, we examine blocks in pairs; 
1.e., B(1) and B(2) first, B(2) and B(3) next and so on. 
For each pair, e.g., B(n) and B(n+1) (see Figure 5), we 
perform a lookup in the fingerprint hash table for H(n) 
and H(n+1), if neither of them is a match, we allocate the 
blocks on disk normally and move to the next pair. When 
we find a match, the matching content-nodes may have 
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Figure 6: Sequence identification example. Sequence identi- 
fication for blocks with multiple duplicates. D1 represents the 
dup-tree for block B(n) and D2 the dup-tree for B(n+1). 


more than one duplicate (i.e., a dup-tree) or just a single 
duplicate (1.e., just an single DBN). Accordingly, to de- 
termine if a sequence exists across the pair, we have one 
of four conditions. They are listed below in increasing 
degrees of difficulty; they are also illustrated in Figure 5. 


1. Both H(n) and H(n+1) match a single content-node: 
Simplest case, if the DBN of H(n) is b, and DBN of 
H(n+1) is (b+1), then we have a sequence. 

2. H(n) matches a single content-node, H(n+1) matches 
a dup-tree content-node: If the DBN of H(n) is b; 
search for (b+1) in the dup-tree of H(n+1). 

3. H(n) matches a dup-tree, H(n+1) matches a single 
content-node: Similar to the previous case with H(n) 
and H(n+1) swapped. 

4. Both H(n) and H(n+1) match dup-tree content-nodes: 
This case is the most complex and can lead to multi- 
ple sequences. It is discussed in greater detail below. 


When both H(n) and H(n+1) match entries with dup- 
trees, we need to identify all possible sequences that 
can start from these two blocks. The optimized red- 
black tree used for the dup-trees has a search primitive, 
nsearch(x), that returns ‘x’ if ‘x’ is found; or the next 
largest number after °x’; or error if ‘x’ is already the 
largest number. The cost of nsearch is the same as that 
of a regular tree search (O(log N)). We use this primitive 
to quickly search the dup-trees for all possible sequences. 
This is illustrated via an example in Figure 6. 


In our example, we show the dup-trees as two sorted 
list of DBNs. First, we compute the minimum and max- 
imum overlapping DBNs between the dup-trees (1.e., 10 
and 68 in the figure), all sequences will be within this 
range. We start with 10, since this is in D/, the dup- 
tree of H(n). We then perform a nsearch(11) in D2, 
the dup-tree of H(n+1), which successfully leads to a se- 
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quence. Since the numbers are ordered, we perform a 
nsearch(12) in D2 to find the next largest potential se- 
quence number; the result is 38. Next, to pair with 38, 
we perform nsearch(37) in D1. However, it results in 
67 (not a sequence). Similarly, since we obtained 67 in 
D1, we perform nsearch(68) in D2, thus, yielding an- 
other sequence. In this fashion, with a minimal number 
of searches using the nsearch primitive, we are able to 
glean all possible sequences between the two blocks. 

It is necessary to efficiently record, manage, and iden- 
tify the sequences that are growing and those that have 
terminated. For each discovered sequence, we manage it 
via a sequence entry: the tuple (Last FBN of sequence, 
Sequence Size, Last DBN of sequence). Suppose, B(n) 
and B(n+1) have started a sequence with sequence entry 
Sl. Upon examining B(n+1) and B(n+2), we find that 
Sl grows and a new sequence, S2, is created. In such 
a scenario, we want to quickly search for S1 and update 
its contents and create a new entry for S2. Therefore, 
we maintain the sequence entries in a hash table indexed 
by a combination of the tuple fields. In addition, as we 
process the blocks, to quickly determine terminated se- 
quences, we keep two lists of sequence entries: one for 
sequences that include the current block and another for 
sequences of the previous block. After sequence identi- 
fication for a block completes, if a sequence entry is not 
in the current block’s list, then it has terminated. 


4.2.2 Sequence pruning 


Once we determine the sequences that have terminated, 
we process them according to their sizes. If a sequence 
is larger than the threshold, we check for overlapping 
blocks with non-terminated sequences using the heuris- 
tic mentioned in Section 3.4.2, and only deduplicate the 
non-overlapped blocks if they form a sequence greater 
than the threshold. For sequences shorter than the thresh- 
old, the non-overlapped blocks are allocated by assigning 
them to new blocks on disk. 


4.2.3. Deduplication of blocks 


For each deduplicated block, the file’s metadata is up- 
dated with the original DBN at the appropriate FBN lo- 
cation. The appropriate block in the refcount file is re- 
trieved (a potential disk IO) and the reference count of 
the original DBN is incremented. We expect the refcount 
updates to be amortized across the deduplication of mul- 
tiple blocks for long sequences. 


5 Experimental evaluation 


In this section, we describe the goals of our evaluation 
followed by details and results of our experiments. 


5.1 Evaluation objectives 


Our goal is to show that a reasonable tradeoff exists be- 
tween performance and deduplication ratio that can be 
exploited by iDedup for latency sensitive, primary work- 
loads. In our system, the two major tunable parameters 
are: 1) the minimum duplicate sequence threshold, and 11) 
the in-memory dedup-metadata cache size. Using these 
paramaters we evaluate the system by replaying traces 
from two real-world, enterprise workloads to examine: 


1. Deduplication ratio vs. threshold: We expect a drop 
in deduplication rate as threshold increases. 

2. Disk fragmentation profile vs. threshold: We expect 
the fragmentation to decrease as threshold increases. 

3. Client read response time vs. threshold: We expect 
the client read response time characteristics to follow 
the disk fragmentation profile. 

4. System CPU utilization vs. threshold: We expect the 
utilization to increase slightly with the threshold. 

5. Buffer cache hit rate vs. dedup-metadata cache size: 
We expect the buffer cache hit ratio to decrease as the 
metadata cache size increases. 


We describe these experiments and their results next. 


5.2 Experimental setup 


All evaluation is done using a NetApp® FAS 3070 stor- 
age system running Data ONTAPY 7.3 [27]. It consists 
of: 8 GB RAM; 512 MB NVRAM;; 2 dual-core 1.8 GHz 
AMD CPUs; and 3 1OK RPM 144 GB FC Seagate Chee- 
tah 7 disk drives in a RAID-O stripe. The trace replay 
client has a 16-core, Intel’ Xeon\ 2.2 GHz CPU with 
16 GB RAM and is connected by a 1 Gb/s network link. 

We use two, real-world, CIFS traces obtained from 
a production, primary storage system that was collected 
and made available by NetApp [20]. One trace contains 
Corporate departments’ data (MS Office, MS Access, 
VM Images, etc.), called the Corporate trace; it con- 
tains 19,876,155 read requests (203.8 GB total read) and 
3,968,452 write requests (80.3 GB total written). The 
other contains Engineering departments’ data (user home 
dirs, source code, etc.), called the Engineering trace; it 
contains 23,818,465 read requests (192.1 GB total read) 
and 4,416,026 write requests (91.7 GB total written). 
Each trace represents © 1.5 months of activity. They are 
replayed without altering their data duplication patterns. 

We use three dedup-metadata cache sizes: 1 GB, 
0.5 GB and 0.25 GB, that caches block mappings for ap- 
proximately 100%, 50% and 25% of all blocks written 
in the trace respectively. For the threshold, we use refer- 
ence values of 1, 2, 4, and 8. Larger thresholds produce 
insignificant deduplication savings to be feasible. 

Two key comparison points are used in our evaluation: 
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Figure 7: Deduplication ratio vs. Threshold. Deduplication 
ratio versus threshold for the different cache sizes for Corporate 
(top) and Engineering (bottom) traces. 


1. The Baseline values represent the system without the 
iDedup algorithm enabled (.e., no deduplication). 

2. The Threshold-1 values represent the highest dedu- 
plication ratio for a given metadata cache size. Since 
a 1 GB cache caches all block mappings, Threshold-1 
at | GB represents the maximum deduplication pos- 
sible (with a 4 KB block size) and is equivalent to a 
static offline technique. 


5.3. Deduplication ratio vs. threshold 


Figure 7 shows the tradeoff in deduplication ratio 
(dedup-rate) versus threshold for both the workloads and 
different dedup-metadata sizes. For both the workloads, 
as the threshold increases, the number of duplicate se- 
quences decrease, correspondingly the dedup-rate drops; 
there is a 50% decrease between Threshold-1 (24%) and 
8 (13%), with a 1 GB cache. Our goal is to maxi- 
mize the size of the threshold, while also maintaining 
a high dedup-rate. To evaluate this tradeoff, we look 
for a range of useful thresholds (> 1) where the drop in 
dedup-rate is not too steep; e.g., the dedup-rates between 
Threshold-2 and Threshold-4 are fairly flat. To minimize 
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Figure 8: Disk fragmentation profile. CDF of number of se- 
quential blocks in disk read requests for the Corporate (top) 
and Engineering (bottom) traces with a 1G cache. 


performance impact, we would pick the largest threshold 
that shows the smallest loss in dedup-rate: Threshold- 
4 from either graph. Moreover, we notice the drop in 
dedup-rate from Threshold-2 to Threshold-4 is same for 
0.5 GB and 0.25 GB (®& 2%), showing a bigger percent- 
age drop for smaller caches. For the Corporate work- 
load, iDedup achieves a deduplication ratio between 66% 
(at Threshold-4, 0.25 GB) and 74% (at Threshold-4, 
1 GB) of the maximum possible (~ 24% at Threshold- 
1, | GB). Similarly, with the Engineering workload, we 
achieve between 54% (at Threshold-4, 0.25 GB) and 
62% (at Threshold-4, 1 GB) of the maximum (+ 23% 
at Threshold-1, 1 GB). 


5.4 Disk fragmentation profile 


To assess disk fragmentation due to deduplication, we 
gather the number of sequential blocks (request size) for 
each disk read request across all the disks and plot them 
as a CDF (cumulative distribution function). All CDFs 
are based on the average over three runs. Figure 8 shows 
the CDFs for both Corporate and Engineering workloads 
for a dedup-metadata cache of 1 GB. Other cache sizes 
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show similar patterns. Since the request stream is the 
same for all thresholds, the difference in disk IO sizes, 
across the different thresholds, reflects the fragmentation 
of the file system’s disk layout. 

As expected, in both the CDFs, the Baseline shows the 
highest percentage of longer request sizes or sequential- 
ity; 1.e., the least fragmentation. Also, it can observed 
that the Threshold-1 line shows the highest amount of 
fragmentation. For example, there is a 11% increase in 
the number of requests smaller or equal to 8, between 
the Baseline and Threshold-1 for the Corporate workload 
and 12% for the Engineering workload. All the remain- 
ing thresholds (2, 4, 6, 8) show progressively less frag- 
mentation, and have CDFs between the Baseline and the 
Threshold-1 line; e.g., a 2% difference between Baseline 
and Threshold-8 for the Corporate workload. Hence, to 
optimally choose a threshold, we suggest the tradeoff is 
made after empirically deriving the dedup-rate graph and 
the fragmentation profile. In the future, we envision en- 
abling the system to automatically make this tradeoff. 


5.5 Client response time behavior 


Figure 9 (top graph) shows a CDF of client response 
times taken from the trace replay tool for varying thresh- 
olds of the Corporate trace at 1 GB cache size. We use 
response time as a measure of latency. For thresholds of 
8 or larger, the behavior is almost identical to the Base- 
line (an average difference of 2% for Corporate and 4% 
for Engineering at Threshold 8) , while Threshold-2 and 
4 (not shown) fall in between. We expect the client re- 
sponse time to reflect the fragmentation profile. How- 
ever, the impact on client response time is lower due to 
the storage system’s effective read prefetching. 

As can be seen, there is a slowly shrinking gap 
between Threshold-1 and Baseline for larger response 
times (> 2ms) comprising ~ 10% of all requests. The 
increase in latency of these requests is due to the frag- 
mentation effect and it affects the average response time. 
To quantify this better, we plot the difference between the 
two curves in the CDF (bottom graph of Figure 9) against 
the response time. The area under this curve shows the 
total contribution to latency due to the fragmentation ef- 
fect. We find that it adds 13% to the average latency and 
a similar amount to the total runtime of the workload, 
which is significant. The Engineering workload has a 
similar pattern, although the effect is smaller (1.8% for 
average latency and total runtime). 


5.6 System CPU utilization vs. threshold 


We capture CPU utilization samples every 10 seconds 
from all the cores and compute the CDF for these val- 
ues. Figure 10 shows the CDFs for our workloads with a 
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Figure 9: Client response time CDF. CDF of client response 
times for Corporate with a 1 GB cache (top); we highlight 
the region where the curves differ. The difference between the 
Baseline and Threshold of 1 CDFs (bottom). 


1 GB dedup-metadata cache. We expect Threshold-8 to 
consume more CPU because there are potentially more 
outstanding, unterminated sequences leading to more se- 
quence processing and management. As expected, com- 
pared to the Baseline, the maximum difference in mean 
CPU utilization occurs at Threshold-8, but is relatively 
small: ~ 2% for Corporate and ~ 4% for Engineering. 
However, the CDFs for the thresholds exhibit a longer 
tail, implying a larger standard deviation compared to 
the Baseline, this is evident in the Engineering case but 
less so for Corporate. However, given that the change 
is small (< 5%), we feel that the iDedup algorithm has 
little impact on the overall utilization. The results are 
similar across cache sizes, we chose the maximal 1 GB 
one, since that represents maximum work in sequence 
processing for the 1Dedup algorithm. 


5.7 Buffer cache hit ratio vs. metadata size 


We observed the buffer cache hit ratio for different sizes 
of the dedup-metadata cache. The size of the dedup- 
metadata cache (and threshold) had no observable ef- 
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Figure 10: CPU Utilization CDF: CDF across all the cores for 
varying thresholds for Corporate (top) and Engineering (bot- 
tom) workloads with a 1 GB cache. Threshold-2 is omitted, 
since it almost fully overlaps Threshold-4. 


fect on the buffer cache hit ratio for two reasons: 1) the 
dedup-metadata cache size (max of | GB) 1s relatively 
small compared to the total memory (8 GB); and 11) the 
workloads’ working sets fit within the buffer cache. The 
buffer cache hit ratio was steady for the Corporate (93%) 
and Engineering (96%) workloads. However, workloads 
with working sets that do not fit in the buffer cache would 
be impacted by the dedup-metadata cache. 


6 Related work 


Data storage efficiency can be realized via various com- 
plementary techniques such as thin-provisioning (not all 
of the storage is provisioned up front), data deduplica- 
tion, and compression. As shown in Table | and as de- 
scribed in Section 2, deduplication systems can be clas- 
sified as primary or secondary (backup/archival). Pri- 
mary storage is usually optimized for IOPs and latency 
whereas secondary storage systems are optimized for 
throughput. These systems either process duplicates in- 
line, at ingest time, or offline, during idle time. 

Another key trade-off is with respect to the deduplica- 
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tion granularity. In file level deduplication (e.g., [18, 21, 
40]), the potential gains are limited compared to dedupli- 
cation at block level. Likewise, there are algorithms for 
fixed-sized block or variable-sized (e.g., [4, 23]) block 
deduplication. Finally, there are content addressable sys- 
tems (CAS) that reference the object or block directly by 
its content hash; inherently deduplicating them [24, 31]. 

Although, we are unaware of any prior primary, inline 
deduplication systems, offline systems do exist. Some 
are block-based [1, 16], while others are file-based [11]. 

Complementary research has been done on inline 
compression for primary data [6, 22, 38]. Burrows et. 
al [5] describe an on-line compression technique for pri- 
mary storage using a log-structured file system. In addi- 
tion, offline compression products also exist [29]. 

The goals for inline secondary or backup deduplica- 
tion systems are to provide high throughput and high 
deduplication ratio. Therefore, to reduce the amount 
of in-memory dedup-metadata footprint and the number 
of metadata IOs, various optimizations have been pro- 
posed [2, 15, 21, 39, 41]. Another inline technique, by 
Lillibridge et al. [21], leverages temporal locality with 
sampling to reduce dedup metadata size in the context of 
backup streams. 

Deduplication systems have also leveraged flash stor- 
age to minimize the cost of metadata IOs [7, 25]. Clus- 
tered backup storage systems have been proposed for 
large datasets that cannot be managed by a single backup 
storage node [8]. 


7 Conclusion 


In this paper, we describe iDedup, an inline deduplica- 
tion system specifically targeting latency-sensitive, pri- 
mary storage workloads. With latency sensitive work- 
loads, inline deduplication has many challenges: frag- 
mentation leading to extra disk seeks for reads, dedupli- 
cation processing overheads in the critical path, and extra 
latency caused by IOs for dedup-metadata management. 

To counter these challenges, we derived two insights 
by observing real-world, primary workloads: 1) there is 
significant spatial locality on disk for duplicated data, 
and 11) temporal locality exists in the accesses of dupli- 
cated blocks. First, we leverage spatial locality to per- 
form deduplication only when the duplicate blocks form 
long sequences on disk, thereby, avoiding fragmentation. 
Second, we leverage temporal locality by maintaining 
dedup-metadata in an in-memory cache to avoid extra 
IOs. From our evaluation, we see that iDedup offers sig- 
nificant deduplication with minimal resource overheads 
(CPU and memory). Furthermore, with careful threshold 
selection, a good compromise between performance and 
deduplication can be reached, thereby, making iDedup 
well suited to latency sensitive workloads. 
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Abstract 


Hybrid storage solutions use NAND flash memory based 
Solid State Drives (SSDs) as non-volatile cache and tra- 
ditional Hard Disk Drives (HDDs) as lower level stor- 
age. Unlike a typical cache, internally, the flash memory 
cache is divided into cache space and over-provisioned 
space, used for garbage collection. We show that bal- 
ancing the two spaces appropriately helps improve the 
performance of hybrid storage systems. We show that 
contrary to expectations, the cache need not be filled with 
data to the fullest, but may be better served by reserving 
space for garbage collection. For this balancing act, we 
present a dynamic scheme that further divides the cache 
space into read and write caches and manages the three 
spaces according to the workload characteristics for op- 
timal performance. Experimental results show that our 
dynamic scheme improves performance of hybrid stor- 
age solutions up to the off-line optimal performance of a 
fixed partitioning scheme. Furthermore, as our scheme 
makes efficient use of the flash memory cache, it re- 
duces the number of erase operations thereby extending 
the lifetime of SSDs. 


1 Introduction 


Conventional Hard Disk Drives (HDDs) and state-of-the- 
art Solid State Drives (SSDs) each has strengths and lim- 
itations in terms of latency, cost, and lifetime. To alle- 
viate limitations and combine their advantages, hybrid 
storage solutions that combine HDDs and SSDs are now 
available for purchase. For example, a hybrid disk that 
comprises the conventional magnetic disk with NAND 
flash memory cache is commercially available [30]. We 
consider hybrid storage that uses NAND flash memory 
based SSDs as a non-volatile cache and traditional HDDs 
as lower level storage. Specifically, we tackle the issue 
of managing the flash memory cache in hybrid storage. 
The ultimate goal of hybrid storage solutions is pro- 
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Figure 1: Balancing data in cache and update cost for 
optimal performance 


viding SSD-like performance at HDD-like price, and 
achieving this goal requires near-optimal management 
of the flash memory cache. Unlike a typical cache, the 
flash memory cache is unique in that SSDs require over- 
provisioned space (OPS) in addition to the space for nor- 
mal data. To make a clear distinction between OPS and 
space for normal data, we refer to the space in flash mem- 
ory cache used to keep normal data as the caching space. 


The OPS is used for garbage collection operations per- 
formed during data updates. It is well accepted that given 
a fixed capacity SSD, increasing the OPS size brings 
about two consequences [11, 15, 26]. First, it reduces 
the caching space resulting in a smaller data cache. Less 
data caching results in decreased overall flash memory 
cache performance. Note Figure | (not to scale) where 
the x-axis represents the OPS size and the y-axis repre- 
sents the performance of the flash memory cache. The 
dotted line with triangle marks shows that as the OPS 
size increases, caching space decreases and performance 
degrades. 


In contrast, with a larger OPS, the update cost of data 
in the cache decreases and, consequently, performance 
of the flash memory cache improves. This is represented 
as the square marked dotted line in Figure 1. Note that 
as the two dotted lines cross, there exists a point where 
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performance of the flash memory cache is optimal. The 
goal of this paper is to find this optimal point and use it 
in managing the flash memory cache. 

To reiterate, the main contribution of this paper is in 
presenting a dynamic scheme that finds the workload de- 
pendent optimal OPS size of a given flash memory cache 
such that the performance of the hybrid storage system 
is optimized. Specifically, we propose cost models that 
are used to determine the optimal caching space and OPS 
sizes for a given workload. In our solution, the caching 
space is further divided into read and write caches, and 
we use cost models to dynamically adjust the sizes of the 
three spaces, that is, the read cache, write cache, and the 
OPS according to the workload for optimal hybrid stor- 
age performance. These cost models form the basis of 
the Optimal Partitioning Flash Cache Layer (OP-FCL) 
flash memory cache management scheme that we pro- 
pose. 

Experiments performed on a DiskSim-based hybrid 
storage system using various realistic server workloads 
show that OP-FCL performs comparatively to the off- 
line optimal fixed partitioning scheme. The results indi- 
cate that caching as much data as possible is not the best 
solution, but caching an appropriate amount to balance 
the cache hit rate and the garbage collection cost is most 
appropriate. That is, caching less data in the flash mem- 
ory cache can bring about better performance as the gains 
from reduced overhead for data update compensates for 
losses from keeping less data in cache. Furthermore, our 
results indicate that as our scheme makes efficient use 
of the flash memory cache, OP-FCL can significantly re- 
duce the number of erase operations in flash memory. 
For our experiments, this results in the lifetime of SSDs 
being extended by as much as three times compared to 
conventional uses of SSDs. 

The rest of the paper is organized as follows. In the 
next section, we discuss previous studies that are rele- 
vant to our work with an emphasis on the design of hy- 
brid storage systems. In Section 3, we start off with a 
brief review of the HDD cost model. Then, we move on 
and describe cost models for NAND flash memory stor- 
age. Then, in Section 4, we derive cost models for hy- 
brid storage and discuss the existence of optimal caching 
space and OPS division. We explain the implementation 
issues in Section 5 and then, present the experimental re- 
sults in Section 6. Finally, we conclude with a summary 
and directions for future work. 


2 Related Work 


Numerous hybrid storage solutions that integrate HDDs 
and SSDs have been suggested [8, 11, 14, 29]. Kegil et 
al. propose splitting the flash memory cache into sep- 
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arate read and write regions taking into consideration 
the fact that read and write costs are different in flash 
memory [11]. Chen et al. propose Hystor that integrates 
low-cost HDDs and high-speed SSDs [4]. To make bet- 
ter use of SSDs, Hystor identifies critical data, such as 
metadata, keeping them in SSDs. Also, it uses SSDs as 
a write-back buffer to achieve better write performance. 
Pritchett and Thottethodi observe that reference patterns 
are highly skewed and propose a highly-selective caching 
scheme for SSD cache [26]. These studies try to reduce 
expensive data allocation and write operations in flash 
memory storage as writes are much more expensive than 
reads. They are similar to ours in that flash memory stor- 
age is being used as a cache in hybrid storage solutions 
and that some of them split the flash memory cache into 
separate regions. However, our work is unique in that it 
takes into account the trade-off between caching benefit 
and data update cost as determined by the OPS size. 


The use of the flash memory cache with other objec- 
tives in mind have been suggested. As SSDs have lower 
energy consumption than HDDs, Lee et al. propose an 
SSD-based cache to save energy of RAID systems [18]. 
In this study, an SSD is used to keep recently referenced 
data as well as for write buffering. Similarly, to save en- 
ergy, Chen et al. suggest a flash memory based cache 
for caching and prefetching data of HDDs [3]. Saxena 
et al. use flash memory as a paging device for the vir- 
tual memory subsystem [28] and Debnath et al. use it 
as a metadata store for their de-duplication system [5]. 
Combining SSDs and HDDs in the opposite direction has 
also been proposed. A serious concern of flash mem- 
ory storage is its relatively short lifetime and, to extend 
SSD lifetime, Soundararajan et al. suggest a hybrid stor- 
age system called Griffin, which uses HDDs as a write 
cache [32]. Specifically, they use a log-structured HDD 
cache, periodically destaging data to SSDs so as to re- 
duce write requests and, consequently, to increase the 
lifetime of SSDs. 


There have been studies that concentrate on finding 
cost-effective ways to employ SSDs in systems. To sat- 
isfy high-performance requirements at a reasonable cost 
budget, Narayanan et al. look into whether replacing disk 
based storage with SSDs may be cost effective; they con- 
clude that replacing disks with SSDs is not yet so [22]. 
Kim et al. suggest a hybrid system called HybridStore 
that combines both SSDs and HDDs [15]. The goal of 
this study is in finding the most cost-effective configura- 
tion of SSDs and HDDs. 


Besides studies on flash memory caches, there are 
many buffer cache management schemes that use the 
idea of splitting caching space. Kim et al. present a 
buffer management scheme called Unified Buffer Man- 
agement (UBM) that detects sequential and looping ref- 
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erences and stores those blocks in separate regions in the 
buffer cache [13]. Park et al. propose CRAW-C (Clock 
for Read And Write considering Compressed file system) 
that allocates three memory areas for read, write, and 
compressed pages, respectively [24]. Shim et al. suggest 
an adaptive partitioning scheme for the DRAM buffer in 
SSDs. This scheme divides the DRAM buffer into the 
caching and mapping spaces, dynamically adjusting their 
sizes according to the workload characteristics [31]. This 
study is different from ours in that the notion of OPS is 
necessary for flash memory updates, while for DRAM, it 
is not. 


3 Flash Memory Cache Cost Model 


In this section, we present the cost models for SSDs 
and HDDs [35]. HDD reading and writing are char- 
acterized by seek time and rotational delay. Assume 
that Cp_rpos and Cp _wpos are sums of the average seek 
time and the average rotational delay for HDD reads and 
writes, respectively. Let us also assume that P is the 
data size in bytes and B is the bandwidth of the disk. 
Then, the data read and write cost of a HDD is derived 
as Cor = Cp_rpos + § and Cpw = Cp_wros + }, trespec- 
tively. (Detailed derivations are referred to Wang [35].) 

Before moving on to the cost model of flash mem- 
ory based SSDs, we give a short review of NAND flash 
memory and the workings of SSDs. NAND flash mem- 
ory, which is the storage medium of SSDs, consists of a 
number of blocks and each block consists of a number 
of pages. Reads are done in page units and take con- 
stant time. Writes are also done in page units, but data 
can be written to a page only after the block contain- 
ing the page becomes clean, that is, after it is erased. 
This is called the erase-before-write property. Due to 
this property, data update is usually done by relocating 
new data to a clean page of an already erased block 
and most flash memory storage devices employ a so- 
phisticated software layer called the Flash Translation 
Layer (FTL) that relocates modified data to new loca- 
tions. The FTL also provides the same HDD interface 
to SSD users. Various FTLs such as page mapping 
FTL [7, 34], block mapping FTL [12], and many hy- 
brid mapping FTLs [10, 17, 19, 23] have been proposed. 
Among them, the page mapping FTL is used in many 
high-end commercial SSDs that are used in hybrid stor- 
age solutions. Hence, in this paper, we focus on the page 
mapping FTL. However, the methodology that follows 
may be used with block and hybrid mapping FTLs as 
well. The key difference would be in deriving garbage 
collection and page write cost models appropriate for 
these FTLs. 

As previously mentioned, the FTL relocates modified 
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Figure 2: Garbage collection in flash memory storage 


data to a clean page, and pages with old data become 
invalid. The FTL recycles blocks with invalid pages by 
performing garbage collection (GC) operations. For data 
updates and subsequent GCs, the FTL must always pre- 
serve some number of empty blocks. As data updates 
consume empty blocks, the FTL must produce more 
empty blocks by performing GCs that collect valid pages 
scattered in used blocks to an empty block, marking the 
used blocks as new empty blocks. The worst case and 
average GC costs are determined by the ratio of the ini- 
tial OPS to the total storage space. It has been shown 
that the worst case and average GC costs become lower 
as more Over-provisioned blocks are reserved [9]. 


If we assume that the FTL selects the block with the 
minimum number of valid pages for a GC operation, 
then the worst case GC occurs when all valid (or invalid) 
pages are evenly distributed to all flash memory blocks 
except for an empty block that is preserved for GC op- 
erations. For now, let us assume that uw is the worst case 
utilization determined from the initial number of over- 
provisioned blocks and data blocks. Then, in Fig. 2(a), 
where there are 3 data blocks containing cached data and 
4 initial over-provisioned blocks, the worst case u is cal- 
culated as 3/(3-+4-—1). (We subtract 1 because the FTL 
must preserve one empty block for GC as marked by the 
arrow in Fig. 2(b).) From u, the maximum number of 
valid pages in the block selected for GC can be derived 
as [u- Np], where Np is the number of pages in a block. 


Then, the worst case GC cost for a given utilization u 
can be calculated from the following equation, where Np 
is the number of pages in a block, C, is the erase cost 
(time) of a flash memory block, and C¢p is the page copy 
cost (time). (We assume that the copyback operation is 
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being used. For flash memory chips that do not support 
copyback, Ccp may be expanded to a sequence of read, 
Cpr , and write, Cprgg, operations.) 


Coc(u) = |u:Np|-Ccop+Ce (1) 


That is, as seen in Fig. 2(b) and (c), a GC opera- 
tion erases an empty block with cost Cg and copies all 
valid pages from the block selected for GC to the erased 
empty block with cost |u-Np]|-Ccp. Then, the garbage- 
collected block becomes an empty block that may be 
used for the next GC. The remaining clean pages in the 
previously empty block are used for subsequent write re- 
quests. If all those clean pages are consumed, then an- 
other GC operation will be performed. 

After GC, in the worst case, there are |(1 —u)-Np| 
clean pages in what was previously an empty block (for 
example, the right-most block in Fig. 2(c)) and write re- 
quests of that number can be served in the block. Let 
us assume that Cprog is the page program time (cost) 
of flash memory. (Note that “page program” and “page 
write” are used interchangeably in the paper.) By divid- 
ing GC cost and adding it to each write request, we can 
derive, Cpw(u), the page write cost for worst case utiliza- 
tion u as follows. 


Cec(u) 


Cpw(u) = ld —u)-Np] + CproG 


(2) 


Equation 2 is the worst case page update cost of flash 
memory storage assuming valid data (or invalid data) 
are evenly distributed among all the blocks. Typically, 
however, the number of valid pages in a block will 
vary. For example, the block marked “Victim for GC” 
in Fig. 2(b) has a smaller number of valid pages than the 
other blocks. Therefore, in cases where the FTL selects a 
block with a small number of valid pages for the GC op- 
eration, then utilization of the garbage-collected block, 
u’, would be lower than the worst case utilization, u. Pre- 
vious LFS and flash memory studies derived and used the 
following relation between uw’ and u [17, 20, 35]. 


eet 


ln u’ 


Let U(u) be the function that translates u to wu’. (In 
our implementation, we use a table that translates u to 
u'.) Then the average page update cost can be derived 
by applying U(u) for u in Equation | and 2 leading to 
Equation 3 and 4. 


Coc(u) = U(u)-Np-Cop + Cr (3) 
C 
Cpw(u) = ECE + CproG (4) 
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4 Hybrid Storage Cost Model 


In the previous section, the garbage collection and page 
update cost of flash memory storage was derived. In 
this section, we derive the cost models for hybrid stor- 
age systems, which consist of a flash memory cache and 
a HDD. Specifically, the cost models determine the op- 
timal size of the caching space and OPS minimizing the 
overall data access cost of the hybrid storage system. In 
our derivation of the cost models, we first derive the read 
cache cost model and then, derive the read/write cache 
cost model used to determine the read cache size, write 
cache size and OPS size. Our models assume that the 
cache management layer can measure the hit and miss 
rates of read/write caches as well as the number of I/O 
requests. These values can be easily measured in real 
environments. 


4.1 Read cache cost model 


On a read request the storage examines whether the re- 
quested data is in the flash memory cache. If it is, the 
storage reads it and transfers it to the host system. If it 
is not in the cache, the system reads it from the HDD, 
stores it in the flash memory cache and transfers it to the 
host system. If the flash memory cache is already full 
with data (as will be the case in steady state), it must in- 
validate the least valuable data in the cache to make room 
for the new data. We use the LRU (Least Recently Used) 
replacement policy to select the least valuable data. In 
the case of read caching, the selected data need only be 
invalidated, which can be done essentially for free. (We 
discuss the issue of accommodating other replacement 
policies in Section 5.) 

Let us assume that Hr(u) is the cache read hit rate for a 
given cache size, which is determined by the worst case 
utilization u, as we will see later. With rate Hr(u), the 
system reads the requested data from the cache with cost 
Cpr, the page read operation cost (time) of flash memory, 
and transfers it to the host system. With rate 1 — Hr(u), 
the system reads data from disk with cost Cpr and, after 
invalidating the least valuable data selected by the cache 
replacement policy, stores it in the flash memory cache 
with cost Cpw(u), which is the cost of writing new data 
to cache including the possible garbage collection cost. 
Then, Cyr, the read cost of the hybrid storage system 
with a read cache, 1s as follows. 


Cur(u) = Hr(u) -Cert+ 
(1 — Hr(u))-(Cor+Crw(u)) (5) 
Let us now take the flash memory cache size into con- 


sideration. For a given flash memory cache size, Sr, 
the read cache size, Sp and the OPS size Sgps can be 
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(b) Read cost 

Figure 3: (a) Read hit rate curve generated using the 
numpy.random.zipf Python function (Zipfian distribution 
with & = 1.2 and range = 120%) and (b) the hybrid stor- 
age read cost graph for this particular hit rate curve, with 
optimal point at 92%. 


(a) Read hit rate 


approximated from u such that Sops © (1 —u)-Spr and 
Sr ~u-Sp. These sizes are approximated values as they 
do not take into account the empty block reserved for 
GC. (Recall the empty block in Fig. 2.) Though calcu- 
lating the exact size is possible by considering the empty 
block, we choose to use these approximations as these 
are simpler, and their influence is negligible relative to 
the overall performance estimation. 

Let us now take an example. Assume that we have a hit 
rate curve Hr(u) for read requests as shown in Fig. 3(a), 
where the x-axis is the cache size and the y-axis is the 
hit rate. Then, with Equation 5, we can redraw the hit 
rate curve with u on the x-axis, and consequently, the 
access cost graph of the hybrid storage system becomes 
Fig. 3(b). The graph shows that the overall access cost 
becomes lower as u increases until u reaches 92%, where 
the access cost becomes minimal. Beyond this point, the 
access cost suddenly increases, because even though the 
caching benefit is still high the data update cost soars as 
the OPS shrinks. Once we find u with minimum cost, the 
read cache size and OPS size can be found from Sgps * 
(1 —u) - Sr and Sp ~ u-Sp. 


4.2 Read and write cache cost model 


Previous studies have shown that due to their difference 
in costs, separating read and write requests in flash mem- 
ory storage has a significant effect on performance [11]. 
Hence, we now incorporate write cost to the model by 
dividing the flash caching space into two areas, namely 
a write cache and a read cache. The read cache, whose 
cost model was derived in the previous subsection, con- 
tains data that has recently been read but never written 
back while the write cache keeps data that has recently 
been written, but not yet destaged. Therefore, data in the 
write cache are dirty and they must be written to the HDD 
when evicted from the cache. When a write is requested 
to data in the read cache, we regard it as a write miss. 
In this case, we invalidate the data in the read cache and 
write the new data in the write cache. We consider the 


case of reading data in the write cache later. 

In the following cost model derivation, we assume 
write-back policy for the write cache. This choice is 
more efficient than the write-through policy without any 
loss in consistency as the flash cache is also non-volatile. 
If the write-through policy must be used, our model 
needs to be modified to reflect the additional write to 
HDD that is incurred for each write to the flash cache. 
This will result in a far less efficient hybrid storage sys- 
tem. 

There can be three types of requests to the flash write 
cache. The first is a write hit request, which is a write re- 
quest to existing data in the write cache. In this case, the 
old data becomes invalidated and the new data is writ- 
ten to the write cache with cost Cpw(u). The second 
is a write miss request, which is a write request to data 
that does not exist in the write cache. In this case, the 
cache replacement policy selects victim data that should 
be read from the write cache and destaged to the HDD 
with cost Cpr + Cpw to make room for the newly re- 
quest data. (Note we are assuming the system is in steady 
state.) After evicting the data, the hybrid storage system 
writes the new data to the write cache with cost Cpw(u). 
The last type of request is a read hit request, which is a 
read request to existing (and possibly dirty) data in the 
write cache. This happens when a read request is to data 
that is already in the write cache. In this case, the request 
can be satisfied with cost Cpr, that is, the flash memory 
page read cost. Note that there is no read miss request to 
the write cache because read requests to data not in cache 
are handled by the read cache. 

Now we introduce a parameter r, which is the read 
cache size ratio within the caching space, where 0 <r < 
1. Naturally, | — r is the ratio of the write cache size. If 
ris 1, all caching space is used as a read cache and, if it 
is O, all caching space is used as a write cache. Let Sc 
denote the total size of the caching space. Then, we can 
calculate the read cache size, Sr, and write cache size, 
Sw, from Sc such that Sr = Sc-r and Sw = Sc: (1-7). 
Note that Sc is calculated from u such that Sc + u- Sp. 
Then, Sr and Sw are determined by u and r. 

Let us assume that the cache management layer can 
measure the read hit rates of the read cache and draw 
Hr(u,r), the read cache hit rate curve, which now has 
two parameters u and r. (We will show that the hit rate 
curve can be obtained by using ghost buffers in the next 
section.) Then, the read cost of the hybrid storage system 
is now modified as follows. 


Cur(u,r) = (1 — Ar(u,r)) - (Cor + Cpw(u)) 
+HAr(u, r) -Cpr 


Let us also assume that we can measure the write hit, 
the write miss, and the read hit rates of the write cache 
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Figure 4: (a) Read and (b) write hit rate curves gener- 
ated using the numpy.random.zipf Python function ((a) 
Zipfian distribution with @ = 1.2 and range = 120%, (b) 
Zipfian distribution with @ = 1.4 and range = 220%) and 
(c) the hybrid storage access cost graph for these hit rate 
curves. 


and draw the hit rate curves. For the moment, let us 
regard the read hit in the write cache as being part of 
the write hit. Assume that Hw(u,r) is the write cache 
hit rate for a given write cache size, and it has two 
parameters that determine the cache size. Then, with 
rate Hw (u,r), a write request finds its data in the write 
cache, and the cost of this action is Hw(u,r) -Cpw(u). 
Otherwise, with rate of 1 — Hw(u,r), the write request 
does not find data in the write cache. Servicing this 
request requires reading and evicting existing data and 
writing new data to the write cache. Hence, the cost is 
(1 — Hw(u,r)) - (Cer +Cpw + Cpw(u)). In summary, the 
write cost of the hybrid storage system can be given as 
follows. 


Cuw (u,r) = (1 — Aw(u,r)) 
- (Cpr + Cow + Cpw(u)) 
+ Hw (u,r) -Cpw(u) 


Now let us consider the read hit case within the write 
cache. Although it is possible to maintain separate read 
hit and write hit curves for the write cache, this makes the 
cost model more complex without much benefits, espe- 
cially in terms of implementation. Therefore, we devise a 
simple approximation method for incorporating the read 
hit case in the write cache. Assume that h’ is the read 
hit rate in the write cache. (Then, naturally, 1 — h’ is the 
write hit rate in the write cache.) Then, with rate h’, the 
read hit is satisfied with cost Cpr and with rate 1 —/’, 
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the write hit is satisfied with cost Cpw(u). Now we can 
calculate the average cost for both read hit and write hit 
such that Cw = (1 —/h’)-Cpw(u) +h’ Cpr. By assum- 
ing Hw(u,r) is the hit rate including both read and write 
hits, the write cost of the hybrid storage system now can 
be given as follows. 


Cyw (u,r) = (1 — Hw (u,r)) 
- (Cpr +Cpw + Cpw(u)) 
+ Hw (u,r) -CwH 


Now, let JOr and [Ow, respectively, be the rate served 
in the read and write caches among all requests. For ex- 
ample, of a total of 100 requests, if 70 requests are served 
in the read cache and 30 requests are served in the write 
cache, then /Opr is 0.7 and JOw is 0.3. Then we can de- 
rive, Cyy (u,r), the overall access cost of the hybrid stor- 
age system that has separate read and write caches and 
OPS as follows. 


Cuy (u,r) = Cyr(u,r) -L[Or+ 
Cuw (u,r) -LOw (6) 


Let us take an example. Assume that, at a certain time, 
the hybrid storage system finds JOr, Ow, h’ to be 0.2, 
0.8, and 0.2, respectively, and the read and write hit rate 
curves are estimated as shown in Fig. 4(a) and (b). In the 
graph, both read and write hit rates increase as caches be- 
come larger but slowly saturate beyond some point. As 
the read and write cache sizes are determined by u and 7, 
we can obtain the read and write cache hit rates for given 
u and r values from the hit rate curves. Then, with the 
cost model of Equation 6, we can draw the overall access 
cost graph of the system as in Fig. 4(c). In the graph, the 
X-axis 1s u and the y-axis is r. These two parameters de- 
termine the read and write cache sizes as well as the OPS 
size. In Fig. 4(c), darker shades reflect lower access cost 
and we pinpoint the lowest access cost with the diamond 
mark pointed to by the arrow. 

Specifically, the minimum overall access cost of the 
hybrid storage system is when u is 0.64 and r is 0.25 for 
this particular configuration. For a 4GB flash memory 
cache, this translates to the read cache size of 0.64GB, 
the write cache size of 1.92GB, and an OPS size of 
1.44GB. 


5 Implementation Issues of Flash Cache 
Layer 


In this section, we describe some implementation 1s- 
sues related to our flash memory cache management 
scheme, which we refer to as OP-FCL (Optimal Parti- 
tioning of Flash Cache Layer). Fig. 5(a) shows the over- 
all structure of the hybrid storage system that we envi- 
sion. The storage system has a HDD serving as main 
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storage and an SSD, which we also refer to as the flash 
cache layer (FCL), that is used as a non-volatile cache 
keeping recently read/written data as previous studies 
have done [4, 11, 15]. As 1s common on SSDs, it has 
DRAM for buffering I/O data and storing data struc- 
tures used by the SSD. The space at the flash cache layer 
is divided into three regions: the read cache area, the 
write cache area, and the over-provisioned space (OPS) 
as shown in Fig. 5(b). OP-FCL measures the read and 
write cache hit and miss rates and the I/O rates. Then, 
it periodically calculates the optimal size of these cache 
spaces and progressively adjusts their sizes during the 
next period. 


To accurately simulate the operations and measure the 
costs of the hybrid storage system, we use DiskSim [2] 
to emulate the HDD and DiskSim’s MSR SSD exten- 
sion [1] to emulate the SSD. Specifically, the simula- 
tor mimics the behaviour of Maxtor’s Atlas 1OK IV disk 
whose average read and write latency is 4.4ms and 4.9ms, 
respectively, with transfer speed of 72MB/s. Also, the 
SSD simulator emulates SLC NAND flash memory chip 
operations, and it takes 25us to read a page, 200us to 
write a page, 1.5ms to erase a block, and 100us to trans- 
fer data to/from a page of flash memory through the bus. 
The page and block unit size is 4KB and 256KB, respec- 
tively, and the flash cache layer manages data in 4KB 
units. 


In the simulator, we modified the SSD management 
modules and implemented additional features as well as 
the OP-FCL. OP-FCL consists of several components, 
namely, the Page Replacer, Sequential I/O Detector, 
Workload Tracker, Partition Resizer, and Mapping Man- 
ager. 


The Page Replacer has two LRU lists, one each for 
the read and write caches, and maintains LRU ordering 
of data in the caches. In steady state when the cache is 
full, the LRU data is evicted from the cache to accom- 
modate newly arriving data. For the read cache, cache 
eviction simply means that the data is invalidated, while 
for write cache, it means that data must be destaged, in- 
curring a flash cache layer read and a disk write oper- 
ation. In the actual implementation, the Page Replacer 
destages several dirty data altogether to minimize seek 
distance by applying the elevator disk scheduling algo- 
rithm. However, we do not consider group destaging in 
our cost model as it has only minimal overall impact. 
This is because the number of data destaged as a group 
is relatively small compared to the total number of data 
in the write cache. 


Previous studies have taken notice of the impact of 
sequential references on cache pollution and thus, have 
tried to detect and treat them separately [13]. The Se- 
quential I/O Detector monitors the reference pattern and 
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Figure 5: OP-FCL architecture 


detects sequential references. In our current implemen- 
tation, consecutive I/O requests greater than 128KB are 
regarded as sequential references, and those requests by- 
pass the flash cache layer and are sent directly to disk to 
avoid cache pollution. 

Besides the Page Replacer that manages the cached 
data, the Workload Tracker maintains LRU lists of ghost 
buffers to simultaneously measure hit rates of various 
cache sizes, following the method proposed by Patter- 
son et al. [25]. Ghost buffers maintain only logical ad- 
dresses, not the actual data and, thus, memory overhead 
is minimal requiring roughly 1% of the total flash mem- 
ory cache. Part of the ghost buffer represents data in 
cache and others represent data that have already been 
evicted from the cache. Keeping information of evicted 
data in the ghost buffer makes it possible to measure the 
hit rate of a cache larger than the actual cache size. To 
simulate various cache sizes simultaneously, we use N- 
segmented ghost buffers. In other words, we divide the 
ghost buffer into N-segments corresponding to N cache 
sizes and thus, hit rates of N cache sizes can be obtained 
by combining the hit rates of the segments. From the hit 
rates of N cache sizes, we obtain the read/write hit rate 
curves by interpolating the missing cache sizes. 

Note that though we use the LRU cache replacement 
policy for this study, our model can accommodate any 
replacement policy so long as they can be implemented 
in the flash cache and the ghost buffer management lay- 
ers. Different replacement policies will generate dif- 
ferent read/write hit rate curves and, in the end, affect 
the results. However, a replacement policy only affects 
the read/write hit rate curves, and thus, our overall cost 
model is not affected. 

These hit rate curves are obtained per period. In the 
current implementation, a period is the logical time to 
process 65536 (2!°) read and write requests. When the 
period ends, new hit rate curves are generated, while a 
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Algorithm 1 Optimal Partitioning Algorithm 


1: procedure OPTIMAL_PARTITIONING 
2 step <— segment _size /total cache_size 
3 INIT_PARMS(op-cost, Op_u, Op-_r) 
4 for u < step; u< 1.0; u+-u-+step do 
5 for r+ 0.0; r< 1.0; r+-r-+step do 
6: cur_cost <— C,,,(u, r) > Call Eq. 6 
q if cur_cost < op-cost then 
8 op-cost <— cur_cost 
9: op-u<—u, opr<—r 
10: end if 


11: end for 
12: end for 
13: ADJUST_CACHE.SIZE(op_u, op-_r) 


14: end procedure 


new period starts. Then, with the hit rate curves gen- 
erated by the Workload Tracker in the previous period, 
the Partition Resizer gradually adjusts the sizes of the 
three spaces, that is, the read and write cache space and 
the OPS for the next period. To make the adjustment, 
the Partition Resizer determines the optimal u and r as 
described in Section 4, and those optimal values in turn 
decide the optimal size of the three spaces. 


To obtain the optimal u and r, we devise an iterative al- 
gorithm presented in Algorithm |. Starting from u=step, 
the outer loop iterates the inner loop increasing u in ‘step’ 
increments while uw is less than 1.0. The two extreme 
configurations that we do not consider are where OPS is 
O% and 100%. These are unrealistic configurations as 
OPS must be greater than 0% to perform garbage collec- 
tion, while OPS being 100% would mean that there is no 
space to cache data. The inner loop starting from r=0 
iterates, calculating the access cost of the hybrid stor- 
age system as derived in Equation 6, while increasing r 
in ‘step’ increments until r becomes greater or equal to 
1.0. The ‘step’ value can be calculated as the segment 
size divided by the total cache size, as shown in the sec- 
ond line of Algorithm 1. The nested loop iterates N x M 
times to calculate the costs, where N is the outer loop 
count, 1/step-1, and M is the inner loop count, I/step+1. 
A single cost calculation consists of 10 ADD, 4 SUB, 11 
MUL, and 4 DIV operations. Finer ‘step’ values may be 
used resulting in finer u and r values, but with increased 
cost calculation overhead. However, computational over- 
head for executing this algorithm is quite small because 
they run once every period and the calculations are just 
simple arithmetic operations. 


Once the optimal u and r and, in turn, the optimal sizes 
are determined, the Partition Resizer starts to progres- 
sively adjust the sizes of the three spaces. To increase 
OPS size, it gradually evicts data in the read or write 
caches. To increase cache space, that is, decrease OPS, 
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GC is performed to produce empty blocks. These empty 
blocks are then used by the read and/or write caches. 
The key role of our Mapping Manager is translating 
the logical address to a physical location in the flash 
cache layer. For this purpose, it maintains a mapping ta- 
ble that keeps the translation information. In our imple- 
mentation, we keep the mapping information at the last 
page of each block. As we consider flash memory blocks 
with 64 pages, the overhead is roughly 1.6%. Moreover, 
we implement a crash recovery mechanism similar to 
that of LFS [27]. If a power failure occurs, it searches 
for the most up-to-date checkpoint and goes through a 
recovery procedure to return to the checkpoint state. 


6 Performance Evaluation 


In this section, we evaluate OP-FCL. For comparison, we 
also implement two other schemes. The first 1s the Fixed 
Partition-Flash Cache Layer (FP-FCL) scheme. This is 
the simplest scheme where the read and write cache is 
not distinguished, but unified as a single cache. The OPS 
is available with a fixed size. This scheme is used to 
mimic a typical SSD of today that may serve as a cache 
in a hybrid storage system. Normally, the SSD would not 
distinguish read and write spaces and it would have some 
OPS, whose size would be unknown. We evaluate this 
scheme as we vary the percentage of the caching space 
set aside for the (unified) cache. The best of these results 
will represent the most optimistic situation in real life 
deployment. 

The other scheme is the Read and Write-Flash Cache 
Layer (RW-FCL) scheme. This scheme is in line with the 
observation made by Kgil et al. [11] in that read and write 
caches are distinguished. This scheme, however, goes a 
step further in that while the sum of the two cache sizes 
remain constant, the size between the two are dynami- 
cally adjusted for best performance according to the cost 
models described in Section 4. For this scheme, the OPS 
size would also be fixed as the total read and write cache 
size is fixed. We evaluate this scheme as we vary the per- 
centage of the caching space set aside for the combined 
read and write cache. Initial, all three schemes start with 
an empty data cache. For OP-FCL, the initial OPS size 
is set to 5% of the total flash memory size. 

The experiments are conducted using two sets of 
traces. We categorize them based on the size of requests. 
The first one, ‘Small Scale’, are workloads that request 
less than 1OOGBs of total data. The other set, “Large 
Scale’, are workloads with over 1OOGBs of data requests. 
Details of the characteristics of these workloads are in 
Table 1. 

The first two subsections discuss the performance as- 
pects of the two class of workloads. Then, in the next 
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Table 1: Characteristics of I/O workload traces 
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Figure 6: Mean response time of hybrid storage 
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subsection, we present the effect of OP-FCL on the life- 
time of SSDs. In the final subsection, we present a sen- 
sitivity analysis of two parameters that needs to be deter- 
mined for our model. 


6.1 Small scale workloads 


The experimental setting is as given in Fig. 5 described 
in Section 5. The specific configuration of the HDD and 


SSD used in these experiments is shown in Table 2 de- 
noted as ‘Config. 1’. All other parameters not explicitly 
mentioned are set to default values. We assume a single 
SSD is employed as the flash memory cache and a single 
HDD as the main storage. This configuration is similar 
to that of a real hybrid drive [30]. 


For small scale workloads we use three traces, namely, 
Financial, Home, and Search Engine that have been used 
in numerous previous studies [7, 11, 15, 16, 17]. The Fi- 
nancial trace is a random write intensive I/O workload 
obtained from an OLTP application running at a finan- 
cial institutions [33]. The Home trace is a random write 
intensive I/O workload obtained from an NFS server that 
keeps home directories of researchers whose activities 
are development, testing, and plotting [6]. The Search 
Engine trace is a random read intensive I/O workload ob- 
tained from a web search engine [33]. The Search Engine 
trace is unique in that 99% of the requests are reads while 
only 1% are writes. 


Fig. 6 shows the results of the cache partitioning 
schemes, where the measure is the response time of the 
hybrid storage system. The x-axis here denotes the ratio 
of caching space (unified or read and write combined) for 
the FP-FCL and RW-FCL schemes. For example, 60 in 
the x-axis means that 60% of the flash memory capacity 
is used as caching space and 40% is used as OPS. The 
y-axis denotes the average response time of the read and 
write requests. In the figure, the response times of FP- 
FCL and RW-FCL schemes vary according to the ratio 
of the caching space. In contrast, the response time of 
OP-FCL is drawn as a horizontal line because it reports 
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Figure 7: Cumulative garbage collection time 
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Figure 8: Hit rate 


only one response time regardless of the ratio of caching 
space as it dynamically adjusts the three spaces accord- 
ing to the workload. 

Let us first compare FP-FCL and RW-FCL in Fig. 6. In 
cases of the Financial and Home traces, we see that RW- 
FCL provides lower response time than FP-FCL. This is 
because RW-FCL is taking into account the different read 
and write costs in the flash memory cache layer. This re- 
sult is in accord with previous studies that considered dif- 
ferent read and write costs of flash memory [11]. How- 
ever, in the case of the Search Engine trace, discriminat- 
ing read and write requests has no effect because 99% of 
the requests are reads. Naturally, FP-FCL and RW-FCL 
show almost identical response times. 

Now let us turn our focus to the relationship between 
the size of caching space (or OPS size) and the response 
time. In Fig. 6(a) and (b), we see that the response time 
decreases as the caching space increases (or OPS de- 
creases) until it reaches the minimal point, and then in- 
creases beyond this point. Specifically, for FP-FCL and 
RW-FCL, the minimal point is at 60% for the Financial 
trace and at 50% for the Home trace for both schemes. In 
contrast, for the Search Engine trace, response time de- 
creases continuously as the cache size increases and the 
optimal point is at 95%. The reason behind this is that 
the trace is dominated by read requests with rare modi- 
fications to the data. Thus, the optimal configuration for 
this trace is to keep as large a read cache as possible with 
only a small amount of OPS and write cache. 
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For the FP-FCL and RW-FCL schemes, the response 
time at the optimal point can be regarded as the off-line 
optimal value because it is obtained after exploring all 
possible configurations of the scheme. Let us now com- 
pare the response time of OP-FCL and the off-line opti- 
mal results of RW-FCL. In all traces, OP-FCL has almost 
the same response time as the off-line optimal value of 
RW-FCL. This shows that the cost model based dynamic 
adaptation technique of OP-FCL is efficient in determin- 
ing the optimal OPS and the read and write cache sizes. 

We now discuss the trade-off between garbage collec- 
tion (GC) cost and the hit rate at the flash cache layer. 
Fig. 7 and 8 depict these results. In Fig. 7, we see that 
for all traces, GC cost increases, that 1s, performance de- 
grades, continuously as caching space increases. The hit 
rate, on the other hand, increases, thus improving perfor- 
mance as caching space increases for all the traces as we 
can see in Fig. 8. For clear comparisons, we report the 
sum of the read and write hit rates for RW-FCL and OP- 
FCL. Note that both schemes measure read and write hit 
rates separately. 

These results show the existence of two contradicting 
effects as caching space is increased, that is, increasing 
cache hit rate, which is a positive effect, and increasing 
GC cost, which is a negative effect. The interaction of 
these two contradicting effects leads to an optimal point 
where the overall access cost of the hybrid storage sys- 
tem becomes minimal. 

To investigate how well OP-FCL adjusts the caching 
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Figure 9: Dynamic size adjustment of read and write caches and OPS 


space and OPS sizes, we continuously monitor their sizes 
as the experiments are conducted. Fig. 9 shows these re- 
sults. In the figure, the x-axis denotes logical time that 
elapses upon each request and the y-axis denotes the to- 
tal (read + write) caching space size and the read cache 
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sizes. Notice that out of the 4GB of flash memory cache Figure 10: Response time of hybrid storage 


space, only 2 to 2.5GBs are being used for the Financial 
trace and less than half is used for the Home trace. Even 
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natural consequence as reads are dominant, garbage col- 
lection rarely happens. Also note that it is taking some 
time for the system to stabilize to the optimal allocation Figure 11: Cumulative garbage collection time 
setting. 
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6.2 Large scale workloads 


Our experimental setting for large scale workloads is as 








shown in Fig. 5 with the configuration summarized as 0 a 
as 0 20 40 60 80 100 0 20 40 60 80 100 
‘Contig. 2’ in Table 2. In this configuration the SSD GachineSpace64)in S60 Caching Space (2%) in SSD 
is l1}6GBs employing four packages of flash memory and i eciaipe (b) MSN 
the HDD consists of three 1OK RPM drives. 
To test our scheme for large scale enterprise work- Figure 12: Hit rate 
loads, we use the Exchange and MSN traces that have 
been used in previous studies [15, 21, 22]. The Exchange 7 Caching Space Size mmm 7 Caching Space Size =m 
trace is a random I/O workload obtained from the Mi- @ 12 a 12 
crosoft employee e-mail server [22]. This trace is com- 8 : 8 ; 
posed of 9 volumes of which we select and use traces g 6 f 6 
of volumes 2, 4, and 8, and each volume is assigned to " : - ; 
each HDD. The MSN trace is extracted from 4 RAID-10 0 cab 0 = 
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volumes on an MSN storage back-end file store [22]. We 
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choose and use the traces of volumes 0, 1, and 4, each as- 
signed to one HDD. The characteristics of the two traces Figure 13: Dynamic size adjustment of read and write 
are summarized in Table 1. caches and OPS 
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Figure 14: Average erase count of flash memory blocks 


Fig. 10, which depicts the response time for the two 
large scale workloads, show similar trends that we ob- 
served with the small scale workloads, in that, as caching 
space increases, response time decreases to a minimal 
point, and then increases again. The response time of 
OP-FCL, which is shown as a horizontal line in the fig- 
ure, is close to the smallest response times of FP-FCL 
and RW-FCL. From these results, we confirm again that 
a trade-off between GC cost and hit rate exists at the flash 
cache layer. 


Specifically, for the Exchange trace shown in 
Fig. 10(a), the minimal point for FP-FCL is at 70%, 
while it is at 80% for RW-FCL. The reason behind this 
difference can be found in Fig. 11 and Fig. 12. Fig. 12(a) 
shows that RW-FCL has a higher hit rate than FP-FCL 
at cache size 80%. On the other hand, Fig. 11(a) shows 
that for cache size of 70% to 80% the GC cost increase 1s 
steeper for FP-FCL than for RW-FCL. These results im- 
ply that, for RW-FCL, the positive effect of caching more 
data is greater than the negative effect of increased GC 
cost at 80% cache size, and vice versa for FP-FCL. These 
differences in positive and negative effect relations for 
FP-FCL and RW-FCL result in different minimal points. 


From the results of the MSN trace shown in 
Fig. 10(b), we observe that FP-FCL and RW-FCL have 
almost identical response times. This is because they 
have almost the same hit rate curves, which means that 
discriminating read and write requests has no perfor- 
mance benefit for the MSN trace. The minimal points 
for FP-FCL and RW-FCL are at cache size 80% for this 
trace. 


As with the small scale workloads, Fig. 13 shows how 
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OP-FCL adjusts the cache and OPS sizes according to 
the reference pattern for the large scale workloads. Ini- 
tially, the cache size starts to increase as we start with 
an empty cache. Then, we see that the scheme stabilizes 
with OP-FCL dynamically adjusting the caching space 
and OPS sizes to their optimal values. 


6.3 Effect on lifetime of SSDs 


Now let us turn our attention to the effect of OP-FCL 
on the lifetime of SSDs. Generally, block erase count, 
which is affected by the wear-levelling technique used by 
the SSDs, directly corresponds to SSD lifetime. There- 
fore, we measure the average erase counts of flash mem- 
ory blocks for all the workloads, and the results are 
shown in Fig. 14. With the exception of the Search En- 
gine, we see that, for FP-FCL and RW-FCL, the aver- 
age erase count is low when caching space is small. As 
caching space becomes larger, the average erase count 
increases only slightly until the caching space reaches 
around 70%. Beyond that point, the erase count increases 
sharply as OPS size becomes small and GC cost rises. In 
contrast, OP-FCL has a low average erase count drawn 
as a horizontal line in Fig. 14. 

In contrast to the other traces, the average erase count 
for the Search Engine trace is rather unique. First, the 
overall average erase count is noticeably lower than that 
of the other traces. Also, instead of a sharp increase ob- 
served for the other traces, we first see a noticeable drop 
as the cache size approaches 80%, before a sharp in- 
crease. The reason behind this is that 99% of the Search 
Engine trace are read requests and the footprint is so 
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Figure 15: Sensitivity analysis of sequential unit size and 
period length on OP-FCL performance 


huge that the cache hit rate continuously increases al- 
most linearly with larger caches as shown in Fig. 8(c). 
This continuous increase in hit rate continuously reduces 
new writes resulting in reduced garbage collection, and 
then eventually to reduced block erases. Beyond the 80% 
point, block erases increase because GC cost increases 
sharply as the OPS becomes smaller. 


6.4 Sensitivity analysis 


In this subsection, we present the effect on the choice 
of the sequential unit size and the length of the period on 
the performance of OP-FCL. The results for all the work- 
loads are reported relative to the parameter settings used 
for all the results presented in the previous subsections: 
the sequential unit size of 128 and period length of 2!°. 

Recall that the sequential unit size determines the con- 
secutive request size that the Sequential I/O Detector re- 
gards as being sequential, and that these requests are sent 
directly to the HDD. Fig. 15(a) show the effect of the se- 
quential unit size. When the sequential unit size is 4 KB, 
OP-FCL performs very poorly. This is because too many 
requests are being considered to be sequential and are 
sent directly to the HDD. However, when the sequential 
unit size is between 16 KB ~ 512 KB, OP-FCL shows 
similar performance showing that performance is rela- 
tively insensitive to the parameter of choice. 

Fig. 15(b) shows the performance of OP-FCL as the 
length of the period is varied from 2!* to 27° requests. 


Overall, the performance is stable. The Home trace per- 
formance deteriorates somewhat for periods of 2!* and 
below, with worse performance as the period shortens. 
The reason behind this is that the workload changes fre- 
quently as observed in Fig. 9. As a result, by the time 
OP-FCL adapts to the results of the previous period, the 
new adjustment becomes stale, resulting in performance 
reduction. We also see that performance is relatively 
consistent and best for periods between 2* to 2!°. For 
periods beyond 2!8, OP-FCL performance deteriorates 
slightly. As the period increases to 27°, performance of 
the Exchange and MSN traces start to degrade. This is 
because the change in the workload spans a relatively 
large range compared to those of small scale workloads 
as shown in Fig. 13. For this reason, OP-FCL of longer 
periods is not dynamic enough to reflect these workload 
changes effectively. Overall though, we find that for a 
relatively broad range of periods performance is consis- 
tent. 


7 Conclusions 


NAND flash memory based SSDs are being used as non- 
volatile caches in hybrid storage solutions. In flash based 
storage systems, there exists a trade-off between increas- 
ing the benefits of caching data and increasing the ben- 
efit of reducing the update cost as garbage collection 
cost is involved. To increase the former, caching space, 
which is cache space that holds normal data, must be 
increased, while to increase the latter, over-provisioned 
space (OPS) must be increased. In this paper, we showed 
that balancing the caching space and OPS sizes has a sig- 
nificant impact on the performance of hybrid storage so- 
lutions. For this balancing act, we derived cost models 
to determine the optimal caching space and OPS sizes, 
and proposed a scheme that dynamically adjusts sizes of 
these spaces. Through experiments we show that our dy- 
namic scheme performs comparatively to the off-line op- 
timal fixed partitioning scheme. We also show that the 
lifetime of SSDs may be extended considerably as the 
erase count at SSDs may be reduced. 

Many studies on non-volatile cache have focussed on 
cache replacement and destaging policies. As a miss at 
the flash memory cache leads to HDD access, it is criti- 
cal that misses be reduced. When misses do occur at the 
write cache, intelligent destaging should help ameliorate 
miss effects. Hence, we are currently focusing our ef- 
forts on developing better cache replacement and destag- 
ing policies, and in combining these policies with our 
cache partitioning scheme. Another direction of research 
that we are pursuing is managing the flash memory cache 
layer to tune SSDs to trade-off between performance and 
lifetime. 
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Abstract 


NAND flash-based solid-state drives (SSDs) are increas- 
ingly popular in enterprise server systems because of 
their advantages over hard disk drives such as higher 
performance and lower power consumption. How- 
ever, the limited and unpredictable lifetime of SSDs 
remains to be a serious obstacle to wider adoption of 
SSDs in enterprise systems. In this paper, we pro- 
pose a novel recovery-aware dynamic throttling tech- 
nique, called READY, which guarantees the SSD life- 
time required by the enterprise market while exploiting 
the self-recovery effect of floating-gate transistors. Un- 
like a static throttling technique, the proposed technique 
makes throttling decisions dynamically based on the pre- 
dicted future write demand of a workload so that the 
required SSD lifetime can be guaranteed with less per- 
formance degradation. The proposed READY technique 
also considers the self-recovery effect of floating-gate 
transistors which improves the endurance of SSDs, en- 
abling to guarantee the required lifetime with less write 
throttling. Our experimental results show that the pro- 
posed READY technique can improve write performance 
by 4.4x with less variations on the write time over the ex- 
isting static throttling technique while guaranteeing the 
required SSD lifetime. 


1 Introduction 


NAND flash memory has been widely used in mobile 
embedded systems like mobile phones, MP3 players, and 
laptop computers because of its low-power consump- 
tion, high mobility, and high performance. Recently, 
as the price-per-byte of NAND flash memory is falling, 
NAND flash-based solid-state drives (SSDs) are increas- 
ingly popular in enterprise servers as well, replacing 
hard disk drives. However, the poor write endurance of 
NAND flash memory 1s still regarded as a main barrier 
for a wide adoption of flash-based SSDs in the enterprise 


market. In order for SSDs to be broadly adopted in the 
enterprise environment, two key problems on the SSD 
lifetime need to be addressed properly. 


The first problem is that the endurance of flash de- 
vices is rapidly decreasing. The endurance of flash-based 
SSDs is directly related to the number of program/erase 
(P/E) cycles allowed to memory cells, which are made 
from floating-gate transistors. Due to the charge trapping 
characteristic of a floating-gate transistor [1, 2], NAND 
flash memory is gradually impaired as the number of 
P/E cycles increases and becomes unreliable beyond a 
maximum number of P/E cycles. As the semiconductor 
process is scaled down and with multi-level cell (MLC) 
technology, the endurance of a floating-gate transistor 1s 
significantly degraded. For example, the maximum num- 
ber of P/E cycles of single-level cell (SLC) flash memory 
fabricated in a 70 nm process is about 100K P/E cycles. 
For 2-bit MLC flash memory fabricated in the 2x nm pro- 
cess, the maximum number of P/E cycles decreases to 3K 
P/E cycles [3, 4, 5] while, for 3-bit MLC flash memory, 
this number is only a few hundred cycles [6]. 


The second problem is the unpredictable lifetime of 
flash devices. Since the endurance of SSDs is dependent 
upon the number of P/E cycles, the SSD lifetime is de- 
termined by extra data written by garbage collection and 
wear-leveling as well as by the number of bytes written 
by applications. This means that, unlike HDDs, the SSD 
lifetime is a function of a workload. Therefore, even if 
the endurance of SSDs seems sufficient, the lifetime of 
SSDs strongly depends on the write intensiveness of the 
workload. For example, SSDs may achieve the required 
lifetime if a small number of write requests are required 
from applications. On the other hand, the same SSDs 
will fail much earlier if they are used in a write inten- 
sive environment. In particular, as cost-effective MLC- 
based SSDs are becoming popular in the enterprise mar- 
ket where write requests are intensive [7, 8], it is a chal- 
lenge to guarantee a minimum SSD lifetime of 3-5 years, 
which enterprise customers often require [9]. 
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In this paper, we overcome these technical difficulties 
by proposing a recovery-aware dynamic throttling tech- 
nique, called READY. A basic concept of READY is to 
throttle write performance by adding throttling delays to 
write requests, so as to guarantee the required SSD life- 
time. With dynamic throttling, the IOPS and bandwidth 
of SSDs is reduced to a certain extent. From the appli- 
cation prospective, applications’ execution times are in- 
creased as if they run on top of a slower device. As a 
result, the amount of write traffic sent to a storage device 
is reduced, lessening the wearing-rate of SSDs. 

The dynamic throttling technique inevitably reduces 
the overall write performance. In order to mitigate per- 
formance degradation, we carefully determine throttling 
delay by predicting future write demands and distribute 
the predicted delay over the entire SSD lifetime so that 
better write response time can be obtained with less 
variations on the response time. In addition, the pro- 
posed dynamic throttling technique takes into account 
the self-recovery characteristic of a floating-gate transis- 
tor. Because of the physical characteristics of NAND 
flash memory, the damage caused by repetitive P/E cy- 
cles can be partially recovered during the idle period 
between two consecutive P/E cycles, improving the en- 
durance of a floating-gate transistor [1, 2, 10, 11, 12]. 
By considering the endurance improvement by the self- 
recovery effect, the proposed READY technique can be 
more optimistic on the total number of data written, 
thus employing a smaller throttling delay. Our evalua- 
tion results show that the proposed throttling technique 
improves the average write response time by 4.4x with 
less variations Over an existing static throttling technique 
while guaranteeing the SSD lifetime. 

This paper is organized as follows. In Section 
2, we briefly explain the endurance characteristics of 
NAND flash memory. Section 3 describes the proposed 
recovery-aware dynamic throttling technique in detail. In 
Section 4, we evaluate the effectiveness of the proposed 
recovery-aware dynamic throttling technique using en- 
terprise benchmarks. Section 5 describes related work 
on improving the SSD endurance. Finally, Section 6 con- 
cludes with summary and directions for future work. 


2 Endurance Characteristics of Flash 


Memory 


In NAND flash memory, program/erase (P/E) opera- 
tions inevitably cause damage to floating-gate transis- 
tors, reducing the overall endurance of memory cells. 
At the device level, memory cells are gradually worn 
out as charges get trapped in the interface and oxide 
layers of a floating-gate transistor during P/E cycles. 
This charge trapping increases the threshold voltage of 
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Figure 1: The achievable number of P/E cycles depend- 
ing on the different idle times. 


a floating-gate, which indicates a logical bit value of a 
cell, and the cell becomes unreliable when the thresh- 
old voltage is higher than a certain voltage margin, e.g., 
0.65V for MLC flash memory [1]. According to [1, 10], 
the increase, OVirap, in a threshold voltage because of 
charge trapping approximately scales with P/E cycles in 
a power-law fashion as follows: 


View = Age iN" + Baw, (1) 


where N is the number of P/E cycles. A;; and Bo; are 
constant and set to 2.97 x 107? and 2.0 x 1077, re- 
spectively. Usually, NAND flash memory vendors do 
not reveal important parameters for their recent products. 
Thus, in this work, A;; and B,; for 20 nm MLC flash 
memory are obtained by scaling up values for 90 nm 
MLC flash memory, which are available to the public, 
so that the number of P/E cycles approximately matches 
3K at the point where 0V:;-ap 18 0.65V. 

A floating-gate transistor also has a self-recovery 
property which heals the damage of a cell by detrapping 
charges captured in the oxide of a cell. This recovery (or 
detrapping) process occurs during the idle time between 
P/E cycles on the same cell, and its effect in general in- 
creases as the logarithm of the idle time, 1.e., detrapping 
x In(t), where ¢ is the length of the idle time. Accord- 
ing to [1, 10, 13], the decrease, OVaetrap, in a threshold 
voltage due to charge detrapping can be expressed as fol- 
lows: ; 

0Vaetrap — Ce : OVirap . me) (2) 
where C;,. is a recovery efficiency and set to 5.63 x 10~? 
according to [2]. to is 1 hour. 

Besides the length of the idle time, there are other fac- 
tors that affect the cell recovery, such as an external tem- 
perature and a programmed threshold voltage. In this 
work, the temperature is assumed to be a room temper- 
ature 25°C because the external ambient temperature of 
a storage device is typically maintained at the room tem- 
perature [14]. The programmed threshold voltage is not 
taken into account in this study because its effect on the 
damage recovery is relatively negligible. 
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Figure 2: A comparison of three difference throttling 
policies: no throttling, static throttling, and recovery- 
aware dynamic throttling. 


According to [1], the effective increase, 6Vi;,, in a 
threshold voltage can be expressed as follows: 


OVin, — OVirap — OVietran: (3) 


Based on Eq. (3), we have plotted the achievable P/E 
cycles of 20 nm MLC flash memory in Figure 1, de- 
pending on the average idle times between two consec- 
utive P/E cycles on the same block. The maximum P/E 
cycles without the recovery effect are 3K. As expected, 
the achievable P/E cycles are gradually increased in pro- 
portional to the length of the idle time. Note that re- 
cent studies that measured the effective P/E cycles of 
real NAND flash parts also reported that the endurance 
of NAND flash memory is higher than P/E cycles in 
datasheets [15, 16]. 

The detrapping phenomenon of a floating-gate transis- 
tor has a positive effect on improving the endurance (or 
increasing P/E cycles) of flash-based SSDs. However, 
most studies use a fixed number of P/E cycles, e.g., 3K 
P/E cycles, provided by flash manufacturers as a primary 
factor to manage the lifetime of SSDs. Therefore, the 
benefit of the damage recovery is not fully utilized. Un- 
like other studies, our recovery-aware dynamic throttling 
technique takes advantage of the self-recovery effect in 
managing the lifetime of SSDs to lessen the performance 
penalty caused by write throttling. 


3 Recovery-Aware Dynamic Throttling 


In this section, we describe the proposed recovery-aware 
dynamic throttling technique. We first introduce the need 
for dynamic throttling in flash-based SSDs using a sim- 
ple motivational example and then explain the main func- 
tions of the proposed throttling technique in detail. 


3.1 Basic Idea 


Figure 2 shows a motivational example of dynamic throt- 
tling in SSDs. The maximum number, C’,,qg, of bytes 
that can be written to the SSD is proportional to the SSD 
capacity and the number of P/E cycles allowed to each 
block. C';,q is thus easily calculated with the following 
equation: SSD capacity x P/E cycles [17]. For example, 
if the SSD capacity is 128 GB and the number of P/E 
cycles is 3K, C',,qg becomes 375 TB. Suppose that a life- 
time, Ts,q, to be guaranteed is 1.5768 -10® seconds, i.e., 
5 years. In the example of Figure 2(a) which does not use 
write throttling, the required lifetime cannot be satisfied 
because the number, W,,o;%, of bytes written to the SSD 
exceeds C',,¢ before T5<a. 


To ensure the lifetime warranty of the SSD, some 
SSD vendors recently have started to adopt static throt- 
tling [18, 19], which is shown in Figure 2(b). Static 
throttling guarantees the required lifetime by limiting the 
maximum bandwidth of the SSD to a certain fixed value, 
which is denoted by Bg¢atic. Static throttling determines 
the value of B,+a+;- based on the assumption of the worst 
case scenario where the number of bytes written per sec- 
ond is always larger than C,,a/T ssa. In this case, Bstatic 
must be fixed to C,,a/Tssq to ensure the required life- 
time. The drawback of this approach is that it is likely 
to underutilize the maximum endurance of the SSD, 1.e., 
Wwork < Cssq at T’ss5q, because of its assumption that 
the SSD must provide the Bgtqz;- bandwidth although 
actual workloads may not be that intensive all the time. 
In addition, due to this conservative assumption, the I/O 
response time is degraded with static throttling. 


In order to overcome the limitation of the static throt- 
tling technique, we propose a recover-aware dynamic 
throttling technique, READY, which is depicted in Fig- 
ure 2(c). By dynamically throttling write requests ac- 
cording to the characteristics of a workload and the re- 
maining SSD lifetime, the proposed READY technique 
fully utilizes the given endurance of the SSD up to the 
maximum, while minimizing performance degradation. 
READY is also aware of the endurance improvement by 
the self-recovery characteristic of memory cells. There- 
fore, the data that can be written to the SSD increase by 
AC ,sq, So the maximum number of writable bytes be- 
comes C.. a (= Cssa + AC ssa). This allows us to guar- 
antee the required lifetime with less throttling overheads. 

In designing a dynamic throttling policy, we focus on 
two aspects of the design requirements of SSDs. The 
first is to determine a throttling delay as low as possible 
so that W,,or~% 18 close to C.. q at the time of Tyg. If 
Wane = C.. , before T’5;4, we Cannot guarantee the re- 
quired lifetime as shown in Figure 2(a). If Wworkn < 
C. q at Tssa, write performance significantly deterio- 
rates, underutilizing the available endurance of the SSD 
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Figure 3: Three main functions of READY. 


like static throttling as depicted in Figure 2(b). The sec- 
ond is to distribute a throttling delay over every write 
request as evenly as possible. Otherwise, response time 
variations can be large, thus lowering the quality of the 
user experience significantly. 

To effectively deal with these design issues, the pro- 
posed dynamic throttling technique has been designed 
with three main functions as shown in Figure 3. The 
write demand predictor is in charge of predicting future 
write demands, which indicate the number of bytes that is 
written to SSDs, by monitoring previous write demands. 
Once the future demand for writes has been predicted, 
the throttling delay estimator determines a throttling de- 
lay by considering both the future write demand and the 
remaining lifetime of SSDs. The epoch-capacity regu- 
lator throttles write performance by applying a throttling 
delay to each write request so that the target SSD lifetime 
will be reached. 


3.2 Estimation of Future Write Demands 


In designing a dynamic throttling policy, it is important 
to estimate the number of bytes that will be written to the 
SSD in advance because the SSD performance must be 
throttled properly if the write demand is expected to be 
too high. The role of the write demand predictor is to 
predict future write demands by monitoring the previous 
write demands of a workload. 

For this purpose, in READY, the entire lifetime, T’,.a, 
of the SSD is divided into epochs. At the beginning 
of each epoch, the write demand predictor estimates the 
number of bytes that is to be written during the epoch 
based on the number of bytes actually written to the 
SSD during the latest epoch. If the data of w;_; have 
been written during the (7 — 1)-th epoch, the write de- 
mand predictor predicts that the same number of bytes 
will be written to the SSD during the 2-th epoch. That 
Is, Ww; & w;—1. This approach is motivated by previous 
observations [20] that showed that enterprise workloads 
often exhibit cyclic behavior with periods between sev- 
eral minutes and several days. Although that work did 
not address I/O demands in storage devices, it showed 
that a strong cyclical behavior is frequently observed in 
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Figure 4: Write demand differences with different epoch 
lengths for exchange and proxy. 


enterprise applications. This means that if the length of 
an epoch is properly decided to include the cyclic period 
of a workload, the write demand observed in the latest 
epoch can be used as a factor that indicates future write 
demands. 

To confirm our hypothesis, we have analyzed the char- 
acteristic of write demands using enterprise traces. We 
have compared the difference in write demands between 
two consecutive epochs while varying the length of an 
epoch from | minute to 2 hours. Our analysis has been 
performed with several enterprise traces from the MSR- 
Cambridge and MS-Production traces [21, 22]. Figure 4 
shows our investigation results for the two traces, proxy 
and exchange. Here, the X-axis represents the write 
demand difference between the predicted write demand 
and the actual one in percentage. For example, if the pre- 
dicted demand is 100 MB and the actual one is 95 MB, 
the write demand difference between them is 5%. The 
Y-axis is the cumulative density function (CDF) of the 
write demand difference of the epochs. The smaller the 
difference, the better the accuracy of future write demand 
prediction is. 

As shown in Figure 4, when the length of an epoch 
is decided properly, it is possible to achieve high accu- 
racy in predicting future write demands. In the case of 
exchange, for about 85% of the epochs, the write de- 
mand difference of less than 30% is obtained with the 
epoch length of 30 minutes. For proxy, the epoch 
length of 30 minutes shows the best accuracy in estimat- 
ing future write demands. This result clearly shows that 
the epoch-based write demand prediction is useful to es- 
timate future write demands. Note that other methods, 
such as a moving average, are also applicable for esti- 
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mating future write demands. 

Since the best epoch length may be different depend- 
ing on a workload and its characteristic (which changes 
with time), the proposed READY technique selects the 
epoch length dynamically adapting to a changing work- 
load. We will discuss this issue in Section 3.5. 


3.3. Calculation of Throttling Delay 


The throttling delay estimator adaptively changes a throt- 
tling delay at every epoch by monitoring the write de- 
mand and the remaining SSD lifetime. At the first epoch, 
i.e., the 0-th epoch, a throttling delay, t6°”, is set to 0. 
Then, at the beginning of each 7-th epoch, the delay es- 
timator increases or decreases a throttling delay, oe. 
based on the expected write demand and the capacity of 
each epoch. The expected write demand indicates the 
number, w;, of bytes that is supposed to be written during 
the z-th epoch. The capacity of an epoch is the number, 
c;, of bytes allowed to be written during the i-th epoch. 
In this work, w; 1s equal to the number, w;_1, of bytes 
written during the (2 — 1)-th epoch under the assumption 
of w; © w;_1. The capacity, c;, of the 2-th epoch is de- 
termined by dividing the remaining capacity, C’,, of the 
SSD by the number of remaining epochs. Here, the re- 
maining capacity, C;., represents the number of bytes that 
can be written to the SSD until it becomes unreliable. 

If w,; 1s equal to c;, we don’t need to change a throttling 


delay for the i-th epoch. Therefore, t¢°'” is the same 


te°'"4 which is the throttling delay of the (i — 1)- 


th epoch. However, if w,; is larger than c; as shown in 
Figure 5(a), it is necessary to increase a throttling delay 
because the data to be written to the SSD are expected to 
be larger than the capacity allocated to the epoch. The 
increase, he in a throttling delay can be expressed 
as follows: 


Atl = tepocn: (= —1)/n ifur>ei, @) 


a 


where n is the number of pages allowed to be written 
to the SSD during the i-th epoch, i.e., c;/page size, and 
tepoch 18 the epoch length. To make the data written dur- 
ing the 2-th epoch equal to c;, (w; — c;) of the data must 
be delayed to the next epoch as shown in Figure 5(a). 
The total time required to delay (w; — c;) of the data can 
be approximated as tenoch - (wi /c; — 1). In our dynamic 
throttling policy, a throttling delay is equally distributed 
to each page write (refer to Section 3.4), so At“ can 
be obtained by dividing the total throttling delay by n. 


Finally, a throttling delay, ae for the i-th epoch is 


determined as follows: t¢°Y = £2004 + Agee, 

If w; is smaller than c; as shown in Figure 5(b), it 
means that the write requests were not intensive enough 
to wear out the device before the required lifetime or they 
were too throttled during the previous epoch. Therefore, 
the throttling delay may be reduced so that more data can 
be written to the SSD. The decrease, Ae in a throt- 


tling delay can be expressed as follows: 


Ci 


a (= . 1) / n ifwi<cq. (5) 


a 


To increase the number of bytes to be written to the 
SSD by (c; — w,;) during the 7-th epoch, a throttling 
delay, oe for the 7-th epoch is reduced by rN aa 
asfollows: °°! ato" a Ag 
poelay < Ay ee ae 


. In the case of 


is 0. This means that it is not 
necessary to apply a throttling delay because the required 
lifetime can be guaranteed without write throttling. 


Until now, we assumed that the number of P/E cycles 
is fixed to a certain number. The achievable P/E cycles, 
however, can be increased depending on the amount of 
the idle time between two consecutive P/E cycles in a 
certain block because of the self-recovery effect of mem- 
ory cells. In order to exploit this endurance improve- 
ment, the throttling delay estimator first estimates the 
number of achievable P/E cycles at the beginning of each 
epoch, using the damage and recovery model mentioned 
in Section 2. In this work, the number of achievable P/E 
cycles is estimated, using the average idle time of ev- 
ery block in the SSD. The idle time is actually some- 
what different among blocks. However, this difference 
is not significant because the wear-leveler of the SSD 
makes the P/E cycles of all available blocks evenly dis- 
tributed. Therefore, the average idle time can be used 
as a useful parameter to estimate the overall endurance 
improvement of the SSD. The estimator then calculates 
the remaining capacity, C;., based on the achievable P/E 
cycles and distributes it to the remaining epochs. Since 
the number of P/E cycles is increased due to the recovery 
effect, the capacity, c;, of each epoch is also increased, 
allowing more data to be written to the SSD with less 
throttling delays. 
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Figure 6: An example of epoch-capacity enforcement. A 
solid line indicates the unused capacity forwarded to the 
next period and a dashed line represents the data delayed 
to the next period or epoch. 


3.4 Enforcement of Epoch Capacity 


Once a throttling delay is decided, we throttle SSD per- 
formance by distributing throttling delays across every 
write as evenly as possible. This regulation policy is ben- 
eficial in minimizing response time variations, but it can- 
not guarantee the required lifetime if write demand pre- 
diction fails and unexpectedly high write traffic comes 
from the host. To resolve this problem, it is necessary 
to adopt an epoch-capacity enforcement policy, which 
prevents more data than the epoch capacity from being 
written to the SSD. 


One of the easiest ways to enforce the epoch capac- 
ity is to stop writing if the epoch capacity is likely to be 
exhausted before the epoch ends. We call such a reg- 
ulation strategy the pessimistic epoch-capacity enforce- 
ment policy. The pessimistic policy divides one epoch 
into periods whose lengths are 1 second each and then 
distributes the capacity of an epoch to all periods evenly. 
If more data than the period capacity were requested to 
write, the epoch-capacity regulator stops writing so that 
overflowed requests are to be written in the next period. 
If there is an unused capacity in the current period, the 
regulator reallocates it to the next period so that it can 
be used during the next period. This period-based capac- 
ity regulation allows us to maintain the minimum write 
throughput when there is unexpectedly high write traffic. 
If we stop writing after the epoch capacity is exhausted, 
the SSD cannot write any data until the epoch ends with 
significant performance degradation. Figure 6 compares 
the situations where no epoch-capacity enforcement pol- 
icy is used and the pessimistic policy is used. Here, we 
assume that the epoch capacity is 4 MB and the number 
of periods is 4. As shown in Figure 6(a), the 4.2 MB data 
are written to the SSD without epoch-capacity enforce- 
ment. With pessimistic epoch-capacity enforcement, the 
maximum number of bytes written to the SSD is limited 
to 4.0 MB as shown in Figure 6(b). 


The weakness of the pessimistic policy is that it does 


FAST 712: 10th USENIX Conference on File and Storage Technologies 


3.00MB 2.0MB_ 1.0MB 


Vv v Vv 
4.0MB 


1.0MB 1.0MB 1.0MB_ 1.0MB 


Po *1 ‘ - | Time Po | *1 ” * Time 
Epoch Period 


(a) With pessimistic epoch-(b) With optimistic epoch- 
capacity enforcement capacity enforcement 


Figure 7: A comparison of the pessimistic and opti- 
mistic epoch-capacity enforcement policies when the 4 
MB data are written during the period po. 


not efficiently handle a bursty I/O pattern which writes a 
large number of data within a relatively short period. Fig- 
ure 7(a) shows how the pessimistic policy behaves under 
a bursty write request. We assume that the capacity of an 
epoch is 4 MB and the number of periods is 4. Consider 
that the 4 MB data are requested during the period po 
while no write requests are issued during the periods p1, 
p2, and p3. In this example, the pessimistic policy throt- 
tles write requests for every period except for ps3 because 
the requested data always exceed the maximum capacity 
of the period. However, since the total number of bytes 
written during the epoch is equal to 4 MB, throttling for 
the periods, po, p1, and pg, is, in fact, unnecessary. 

We resolve this overly restrictive throttling behavior 
for bursty write requests by proposing the optimistic 
epoch-capacity enforcement policy. Our optimistic pol- 
icy maintains a relatively small amount of the spare ca- 
pacity for each epoch and forcibly throttles write perfor- 
mance only when both the capacity of a period and the 
Spare capacity are exhausted. Figure 7(b) shows an ex- 
ample of the optimistic policy with the same scenario 
shown in Figure 7(a). Here, we assume that the spare 
capacity is set to 4 MB. As shown in Figure 7(b), un- 
necessary throttling can be completely avoided with the 
optimistic epoch-capacity enforcement policy. 

The spare capacity must be carefully chosen. Suppose 
that the spare capacity is unlimited and there is unexpect- 
edly high write traffic. In that case, READY borrows as 
much capacity as possible from future epochs without 
limitation and then uses it up. If unexpected write de- 
mands frequently occur and write demands are gradually 
increasing, the SSD is worn out before the required life- 
time. On the other hand, if the spare capacity is too small, 
unnecessary throttling with a bursty I/O pattern would be 
frequently observed due to the lack of spare capacity. In 
this work, the spare capacity is empirically set to 10% 
of the remaining capacity, C’., of the SSD. This capac- 
ity 1s sufficient enough to avoid unnecessary throttling in 
real-world traces. Furthermore, since the spare capacity 
is limited to 10% of C’,, the worn-out of SSDs before the 
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Figure 8: Reconstruction of write demand distribution. 


target lifetime never occurs. 

Suppose that the spare capacity is 10% and there are 
n epochs. The capacity of each epoch is Co, ..., Cn—1, 
respectively. Note that co =... = Cn_1 = C;,./n as 
mentioned in Section 3.3. The spare capacity for the 
Q-th epoch is (cy + ... + Cn—1) - 0.1, and thus the to- 
tal capacity that can be written during the 0-th epoch is 
co + (cy +... + Cn—1)- 0.1. If n is 3 and C,. is 3 MB, co 
is | MB and the spare capacity is 0.2 MB. If the data of 
less than co have been written during the 0-th epoch, the 
remaining capacity, C’;., of the SSD after the 0-th epoch 
is equally distributed to the remaining epochs and then 
the spare capacity is determined by (co +... + Cn—1)-0-1 
for the 1-st epoch. If the spare capacity, however, is par- 
tially used during the 0-th epoch, then cy, ... , Cn_1 are 
reduced to 90% of their original capacities and only the 
unallocated capacity is used as the spare capacity. For 
example, in the above example, if the data of 1.1 MB 
have been written during the 0-th epoch, c; and c2 are 0.9 
MB and the spare capacity becomes 0.1 MB in the 1-st 
epoch. This capacity assignment policy makes the throt- 
tling delay estimator slightly increase a throttling delay 
with a smaller epoch capacity. The overused capacity is 
accordingly reclaimed during the remaining epochs. If 
the spare capacity is used up during the O-th epoch, the 
pessimistic policy is used with the reduced epoch capac- 
ity, 1.e., 0.9 MB, and no spare capacity. This means that 
performance degradation caused by the depletion of the 
Spare capacity is 10% in the worst case. 


3.5 Epoch Length Selection 


The length of an epoch must be carefully decided. If 
the epoch length is chosen improperly, the difference in 
write demands between epochs becomes large. Since a 
throttling delay is determined by the write demand of the 
previous epoch, the incorrect epoch length can make a 
large fluctuation in the overall I/O response time of the 
SSD. To determine the proper epoch length, we monitor 
write demands of a workload and find repeated cycles 
that show similar write demands. We then choose that 
cycle as the epoch length. 


Unit time window Write demand 


7 “s 


Candidate Epoch | dj =3MBy d; = 6MB | d}=3MB | d; = 6MB ; 
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2 2 
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Figure 9: An overall procedure of epoch length selection. 


To realize this in READY, we collect information 
about write demands, 1.e., the number of bytes written 
per unit-time window, at runtime. The write demands 
collected here, however, include throttling delays that 
distort the actual write demands of applications. There- 
fore, it is necessary to reconstruct write demand distri- 
bution when throttling is not applied. We estimate this 
original write demands by rebuilding unit-time windows 
without throttling delays as shown in Figure 8. 

To find the proper epoch length, we use a simple ap- 
proach that attempts to find the best epoch candidate, 
which exhibits the smallest fluctuation in write demands, 
by creating and evaluating several candidate epochs with 
different lengths. Figure 9 shows our approach in choos- 
ing an epoch length. We first create a candidate epoch 
whose length, &, is one unit-time window, 1.e., k = 1. We 
then calculate the write-demand difference ratio of two 
consecutive epochs 7 andz+ 1 with the same length. The 


write-demand difference ratio, re, i+1)? is defined as fol- 
lows: F f 
dk. — di 
k = a+1 a 
"Git1) — dk (6) 


i 
where di’ and dj, , are the number of bytes written dur- 
ing the epochs 7 and 2 + 1, respectively. For example, 
in the example of Figure 9, dj and dj are 3 MB and 6 


MB, respectively, and thus ro 1) = 1.0. We calculate the 


average write-demand difference ratio, ie, for all avail- 
able pairs of two consecutive epochs. In the example of 
Figure 9, ju for r(0,1)9 — rio 3 becomes 1.0. 

We then increase the length of a candidate epoch by 
one unit-time window and calculate **+. We repeat this 
until the number of epochs with the same length becomes 
one. Finally, the length of a candidate epoch whose aver- 
age write-demand difference ratio is the smallest is cho- 
sen as the epoch length, te,o-n. For example, in Figure 9, 
ae is the smallest when k is 2, and thus the new epoch 
length becomes the length of two unit-time windows. Af- 
ter choosing the new epoch length, READY recalculates 
a throttling delay using Eq.(4) if dynamic throttling is 
necessary. The new epoch length is determined under 
the assumption that there are no throttling delays. The 
epoch length, tenoch, is thus increased to tenoch + (Wi /Ci) 
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to include delays caused by throttling. 

Finding the epoch length may take a relatively long 
time. To mitigate the computational overhead caused by 
epoch length selection, the epoch length is recalculated 
when the write-demand difference ratio between the pre- 
dicted write demand and the actual one is higher than 
0.25 and it occurs three times successively. The length 
of a unit-time window is also set to 10 minutes to fur- 
ther reduce the computational overhead. In this work, 
0.25 is chosen empirically by considering both compu- 
tational overhead and the accuracy of write demand pre- 
diction. However, this number can be further optimized 
in several ways. For example, the write-demand differ- 
ence ratio that triggers epoch length recalculation can be 
adaptively changed depending on the characteristics of a 
workload. If the difference ratio is always smaller than 
0.25, we can reduce this number, e.g., 0.15, to find a bet- 
ter epoch length. On the other hand, if the difference ra- 
tio is much larger than 0.25 all the time, it may be better 
to reduce this number, e.g., 0.35, to avoid useless com- 
putational overhead. 


4 Experimental Results 


In this section, we first describe our experimental settings 
and explain enterprise benchmarks used for the evalua- 
tions in detail. We then analyze the benefit of the pro- 
posed READY technique over the static throttling tech- 
nique in terms of SSD lifetime, response time, and re- 
sponse time variations. 


4.1 Experimental Settings 


To evaluate the effectiveness of the proposed READY 
technique, we have performed our evaluations using the 
DiskSim-based SSD simulator [23]. The flash memory 
used for the evaluations was based on 2-bit MLC NAND 
flash memory, and each block was composed of 64 4 KB 
pages. The page read time and the page write time were 
50 ps and 600 ts, respectively, and the block erasure 
time was 2 ms. The number of P/E cycles allowed to a 
block was initially set to 3K, but it was changed depend- 
ing on the length of the idle time based on our recovery 
model. The target lifetime of the SSD was set to 5 years. 

We have implemented the static and dynamic throt- 
tling techniques in the SSD simulator, along with the 
damage and recovery model described in Section 2. The 
throttling module was implemented between the host in- 
terface, e.g., SATA, and the flash translation layer (FTL). 
The throttling module intercepted write requests destined 
for the FTL and then applied a throttling delay if it was 
required for the lifetime guarantee. The FTL employed 
a page-level address mapping algorithm with a greedy 
garbage collection policy and used a hot-cold swapping 
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SSD 
capacity (GB) 


. Data written 


proxy 1 week a2 
prey 1 week i 
exchange 1 day 128 
map 1 day 128 
msnfs 6 hours 128 





Table 1: A summary of traces used for evaluations. 


algorithm for wear-leveling [23]. Note that there were no 
changes at the FTL level for throttling because the throt- 
tling module has been designed to operate independently 
regardless of the underlying FTL algorithms. 

We compared the performance of five SSD config- 
urations: NT, ST, DT, READYprs, and READYoprt. 
NT does not use write throttling, so it cannot guaran- 
tee the target SSD lifetime if write traffic is very in- 
tensive. ST and DT use static throttling and dynamic 
throttling, respectively. Note that DT uses the optimistic 
epoch-capacity enforcement policy by default. Both 
READYprs and READYoprt are different from other 
configurations in that they take into account the self- 
recovery effect of memory cells. READYprg uses the 
pessimistic epoch-capacity enforcement policy, whereas 
READYoprt employs the optimistic policy. 


4.2 Benchmarks 


We have chosen two enterprise traces, proxy and 
proj from the MSR-Cambridge benchmark [21] and 
have used three production traces, exchange, map, 
and msnfs, from the MS-Production benchmark [22]. 
Table 1 summarizes the traces used for our evalua- 
tions. proxy and proj were recorded for one week. 
exchange and map contains 24-hour I/O activities, 
while msnfs was collected for 6 hours. Because of the 
limited duration of the traces, it was impossible to assess 
the lifetime guarantee of 5 years with them. For this rea- 
son, we performed our evaluations under the assumption 
that the same I/O pattern is repeated for 5 years. 

The write demand is very different depending on the 
traces. proxy and proj exhibit a low write demand 
in comparison with exchange, map, andmsnfs. The 
write amplification factor (WAF), which has a great ef- 
fect on the write demand, ranges from 1.62 to 2.26 ac- 
cording to the characteristic of I/O references [24]. For 
the evaluations, the SSD capacity was configured differ- 
ently depending on the traces so that the lifetime of the 
SSD is to be a problem. For proxy and proj witha 
low write demand, the SSD capacity was set to 32 GB. 
For exchange, map, and msnfs with a high write de- 
mand, the capacity of the SSD was set to 128 GB. 

In practice, this capacity planning is carefully decided 
by customers’ requirements. If customers are ready to 
pay money to obtain a long lifetime and high perfor- 
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Figure 10: A comparison of effective SSD lifetimes for 
five traces with different SSD configurations. 


mance, an over-provisioned configuration with a larger 
capacity SSD is the best choice. If customers require a 
low initial cost, but can manage a relatively high opera- 
tional cost, a smaller capacity SSD with a shorter target 
lifetime, 1.e., 3 years, is preferred. For customers who 
want reasonable performance with relatively lower cost 
and a longer target lifetime, e.g., 5 years, the settings 
shown in Table | may be a better choice. 

All the traces used in this work were collected from 
hard disk drives (HDDs) which exhibit much lower 
I/O performance than SSDs. Since SSDs increase the 
overall I/O rate of the storage subsystem by several 
times [25, 26], the number of bytes written to a storage 
device during the same time period will be largely in- 
creased in comparison with HDDs. That is, ‘data written 
per hour (GB)’ shown in Table 1 becomes larger, and 
thus READY more aggressively throttles write perfor- 
mance because of increased write traffic. Therefore, the 
SSD capacity in Table | is set relatively conservative for 
systems that use SSDs as a secondary storage device. 


4.3. Lifetime Analysis 


We first analyze the lifetime of the SSD for five respec- 
tive traces. Figure 10 shows the effective lifetime of the 
SSD with different SSD configurations. Here, the effec- 
tive lifetime is the lifetime which is estimated based on 
the assumption that the I/O activities of the traces are 
repeated for 5 years. Note that the self-recovery effect 
of memory cells is taken into account in estimating the 
SSD lifetime. As shown in Figure 10, NT cannot guar- 
antee the required lifetime of the SSD for all the traces, 
except for proj. In our observation, the write demand 
of proj is not intensive, and thus the SSD can achieve 
the lifetime more than 5 years without write throttling. 
ST and DT do not consider the self-recovery effect of 
floating-gate transistors, and therefore they throttle write 
performance based on the fixed 3K P/E cycles. Since the 
P/E cycles of the SSD are increased due to the effect of 
self-recovery, the effective SSD lifetimes with ST and DT 
are much longer than the required lifetime. This means 
that ST and DT excessively throttle write performance, 
underutilizing available P/E cycles of the SSD. This ex- 


144.4 
54.2 
proj 93.7 
READYpgEs 141.0 
READYoptT 141.0 
1918.8 
348.2 
exchange 374.4 
READYpgs 1077.2 
READYopt 1065.6 





Table 2: The amount of data written for 5 years for two 
traces, proj and exchange. 


cessive throttling results in poor write performance in 
comparison with READYprs and READYoprt. In par- 
ticular, DT dynamically decides a throttling delay in re- 
sponse to a changing workload. Therefore, DT maxi- 
mizes the utilization of P/E cycles within 3K unlike ST. 
We will discuss this issue in detail with Table 2. 


READYpgg takes advantage of the self-recovery ef- 
fect of memory cells. Therefore, it throttles write perfor- 
mance so that the effective lifetime of the SSD is close 
to the required lifetime for all the traces. Figure 10 also 
shows that READYopry guarantees the required SSD life- 
time even though it uses the capacity borrowed from fu- 
ture epochs in advance. This clearly shows that the opti- 
mistic epoch-capacity enforcement policy properly man- 
ages overused epoch capacity so that the given lifetime 
is to be satisfied. 


Table 2 analyzes the lifetime of the SSD from the 
perspective of written data for two traces, proj and 
exchange. As mentioned in Section 2, C',,q repre- 
sents the number of bytes that can be written to the 
SSD according to the NAND flash memory specification, 
whereas C.. , 1S the total number of writable bytes when 
the recovery effect is taken into account. Wyo;~ 1s the 
total number of bytes written to the SSD for 5 years. 


As expected, ST and DT throttle write performance so 
that W.,or~ becomes close to C’,,g. In particular, in the 
case Of ST, Work 18 about 43% and 8% smaller than 
C'.sq for proj and exchange, respectively. This is be- 
cause ST excessively throttles write performance, assum- 
ing that write requests are always intensive. Unlike ST, 
DT dynamically changes a throttling delay according to 
the write demands of a workload and the remaining life- 
time so that W,,0,x 18 close to C’,,q, allowing more data 
to be written to the SSD. 


READYprgs and READYopry fully utilize the en- 
durance improvement offered by the self-recovery effect, 
making W wor, close to C.. q at the target SSD lifetime. 
In the case of proj, since the endurance of the SSD is 
sufficient enough to guarantee the required 5-year life- 
time, throttling is not performed in most cases. 
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Figure 12: Cumulative distribution functions (CDFs) of write response times. 
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Figure 11: A comparison of average write response times 
for five traces with different SSD configurations. 


4.4 Performance Analysis 


To evaluate the performance benefit of the proposed 
READY technique, we measured the average response 
time of a page write while running five traces with dif- 
ferent SSD configurations. Figure 11 shows the our eval- 
uation results. As expected, NT exhibits the best I/O re- 
sponse time among all of the evaluated configurations, 
but it cannot guarantee the target lifetime as shown in 
Figure 10 because it does not throttle write performance. 
The average write response time of NT is close to the 
page access time, 1.e., 600 psec, with little variation. 
Both READYprs and READYoprt throttle write re- 
quests to meet the required lifetime, so their performance 
is worse than that of N7; they exhibit 1.0x to 2.13x 
higher write response time than NT. In the case of proj, 
READYpgsg and READYopr do not reduce write per- 
formance because the required lifetime can be satisfied 
without throttling. Therefore, little performance degra- 
dation, which is less than 1.9%, is observed in proj. 
READYpgg and READYopr achieve 2.57x better perfor- 
mance than DT on average. This performance benefit 
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mainly comes from the increased P/E cycles of the SSD. 
Since READYprs and READYopr are aware of the im- 
provement in SSD endurance, they can assign more ca- 
pacity to epochs, reducing throttling delays. 

DT exhibits 1.7x faster response time over ST on av- 
erage. DT determines the epoch capacity periodically 
based on the remaining lifetime of the SSD and changes 
a throttling delay so that write requests are properly de- 
layed in response to future write demands. This epoch 
capacity assignment and throttling delay distribution pol- 
icy allows us to fully utilize the available endurance of 
the SSD. On the other hand, ST neither considers the re- 
maining lifetime of the SSD nor the characteristic of a 
workload in making a throttling decision. Instead, ST 
simply throttles write performance by limiting the maxi- 
mum bandwidth of the SSD. Therefore, ST causes many 
unnecessary throttling delays. 

The response time variation is one of the important 
design issues that must be taken into account in design- 
ing throttling algorithms. We compared response time 
variations between different SSD configurations. Fig- 
ure 12 shows the cumulative density functions (CDFs) 
of write response times for five traces. As shown in Fig- 
ure 12, ST shows significant response time variations for 
all the traces because it forcibly stops writing if throt- 
tling is needed. On the other hand, by distributing throt- 
tling delays to write requests as evenly as possible, NT, 
READYpgg, and READYoprt greatly reduce variations 
on the write response time. 

For exchange, msnfs, and map, READYp gs incurs 
relatively high I/O response time variations in compari- 
son with READYopr. READYpgs must stop writing data 
when there are a large number of writes within a short 
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Traces ments 


Accuracy of write 


demand predicion (a) | 9 | 89 | 805 | 509 | 1 


Table 3: Accuracy of write demand prediction. 


period. On the other hand, READYopry uses the opti- 
mistic epoch-capacity enforcement policy, so they handle 
a bursty I/O pattern more efficiently without compulso- 
rily write throttling. 

The write response time of DT ranges from 600 psec 
to several thousand seconds in map and proj unlike 
proxy, exchange, and msnfs. The write patterns 
of map and proj change greatly with time, and thus 
the difference in write demands between two consecu- 
tive epochs is relatively large. Since a throttling delay 
for a certain epoch is determined by the write demand 
of the previous epoch, the difference between throttling 
delays is accordingly increased in map and proj. Nev- 
ertheless, the response time of DT is more stable than 
ST. 

We evaluated the accuracy of our epoch length selec- 
tion method in predicting future write demands. Table 3 
shows our evaluation results for five traces. We assume 
that epoch length detection is accurate if the difference 
between the prediction write demand and the actual one 
is smaller than 25%. As shown in Table 3, our method 
achieves high accuracy for proxy, exchange, and 
msnfs. The accuracy of write demand prediction, how- 
ever, 1s reduced to 50.9% and 33.9% for map and proj, 
respectively, due to a high fluctuation in write requests. 
We expect that the accuracy of epoch length detection 
may be improved with traces recorded for a longer time. 

To evaluate the effect of the epoch length selection 
method on the SSD response time, we compared the 
changes in throttling delays when the fixed epoch length 
is used and the epoch length is dynamically determined 
according to a workload. For the evaluation, we executed 
the exchange trace, which is a 24-hour trace, repeat- 
edly. Figure 13 shows our evaluation result. In this fig- 
ure, FIXED represents READYo pr with the fixed epoch 
length and DYNAMIC is the SSD configuration when 
READYopr uses the proposed epoch length detection 
method. The fixed epoch length was set to 10 minutes. 
As shown in Figure 13, even though READYo pr gen- 
erally works well with exchange, some variations on 
throttling delays are observed with FIXED. DYNAMIC 
also exhibits variations on response times at the begin- 
ning of the execution, but it becomes stable after repeated 
write demands are detected as shown in Figure 13(b). 


4.5 Detailed Analysis 


We performed a detailed analysis of different SSD con- 
figurations. Figure 14 represents the throughput of the 
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Figure 13: A comparison of throttling delays when the 
fixed epoch length is used and the epoch length is dy- 
namically determined in the exchange trace. 


SSD with the different throttling schemes when intense 
I/Os are being served. As mentioned several times be- 
fore, ST limits the maximum bandwidth of the SSD by 
a certain level, 2.49 MB/s in map. The overall through- 
put of the SSD is thus greatly deteriorated with ST as 
shown in Figure 14(a). DT works better than ST. Due 
to the limited write endurance of the SSD, however, 
significant performance degradation cannot be avoided 
with DT as depicted in Figure 14(b). READYpgrg and 
READYopr exhibit much higher performance than ST 
and DT by exploiting the improved write endurance of 
the SSD benefited from the self-recovery effect of mem- 
ory cells. In particular, READYop7 performs better than 
READYpgg when a large number of data are being writ- 
ten to the SSD, e.g., a period of 200 to 350 second in 
Figure 14(d). Even when write requests are intensively 
issued, READYopr writes the requested data to the SSD 
rather than forcibly throttling the bandwidth of the SSD 
by using the spare capacity borrowed from future epochs. 
This allows READYopr to exhibit better write response 
time for the traces like map which exhibit a great fluctu- 
ation in write requests. 


5 Related Work 


There have been a lot of studies on improving the en- 
durance of flash-based SSDs. Many existing garbage 
collection and wear-leveling techniques [27, 28, 29, 30, 
31, 32, 33, 34] are designed to improve the lifetime of 
SSDs by avoiding useless data migration during a block 
recycling process or by distributing P/E cycles of flash 
blocks as evenly as possible. 

As the endurance of flash memory continuously dete- 
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Figure 14: A detailed analysis of four SSD configura- 
tions with the map trace. 


riorates, several endurance enhancement techniques that 
aggressively reduce the number of data written to SSDs 
have been proposed. Data de-duplication [26, 35, 36] 
and data compression [37, 38] are representative exam- 
ples of these policies. Data de-duplication detects du- 
plicate data blocks that already exist in a storage device 
and then eliminates redundant writes to SSDs for such 
blocks. Data compression eliminates repeated bit pat- 
terns within a data block, reducing writes to SSDs. These 
techniques are useful in improving the lifetime of SSDs, 
but they have some limitations in that none of them guar- 
antee the SSD lifetime or make use of the recovery effect 
of a memory cell. 


More recently, the approaches that exploit the recov- 
ery effect of flash devices have received considerable at- 
tention. This paper is an improved version of our pre- 
liminary work [39]. Mohan et al. investigated the ef- 
fect of the damage recovery on the lifetime of SSDs for 
enterprise servers [1]. They claimed that the endurance 
of NAND flash memory was durable enough even for 
I/O intensive enterprise applications because of its recov- 
ery ability. However, their investigations were limited 
to 90 nm SLC and MLC flash memories which exhibit 
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good endurance properties. They also did not exploit the 
benefit of the recovery effect in ensuring the lifetime of 
SSDs. Wu et al. presented an endurance enhancement 
technique that boosts recovery speed by heating a flash 
chip worn out under high temperature [10]. By leverag- 
ing the temperature-accelerated recovery, it improved the 
endurance of SSDs up to five times. However, one of the 
major drawbacks of this approach is that it requires ex- 
tra energy consumption to heat flash chips, lowering the 
energy efficiency of a storage device. Unlike Wu’s work, 
our study considers the endurance improvement of SSDs 
at the room temperature and exploits this benefit to guar- 
antee the lifetime of SSDs with less throttling overhead. 


6 Conclusions 


In this paper, we proposed a recovery-aware dynamic 
throttling technique, READY, to overcome two main 
problems in realizing the adoption of SSDs in enterprise 
server systems: the continuously decreasing endurance 
and unpredictable lifetime problems. READY throttles 
write performance so that the required lifetime of SSDs 
is to be satisfied. In order to guarantee the SSD life- 
time with less throttling overhead, READY exploits the 
recovery effect of a floating-gate transistor which effec- 
tively increases the number of effective P/E cycles of 
SSDs. Our evaluation results showed that the proposed 
throttling technique guarantees a lifetime warranty, while 
achieving a relatively small reduction in write response 
time and little response time variation over the static 
throttling technique. 

READY can be improved in several directions. The 
stress and recovery model of this work is based on the 
previous studies on the physical characteristics of flash 
memory [1, 2, 10]. These studies carefully modeled 
the stress and recovery characteristics of flash memory, 
but their scopes were limited to NOR or NAND flash 
memory fabricated in over 90 nm technology. To build 
a more accurate stress and recovery model, we will 
perform investigations using real NAND flash parts 
which are fabricated in less than 30 nm technology. We 
also plan to implement READY in a real SSD platform 
to evaluate its effectiveness in real systems. 
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